library(tidyverse)
library(magrittr)
library(scales)
library(hexbin)
data <- read_csv("prosperLoanData.csv") %>% mutate_if(is.character, as.factor)
Above, I imported the character columns as factors, as having taken a closer look at the data, they are labels for categories, rather than strings (in the following analysis, I don’t find any disconfirmation of this). The first thing I will do now is take a closer look at the data, and see if other columns are formatted appropriately:
data[,1:7]
## # A tibble: 113,937 x 7
## ListingKey ListingNumber ListingCreationDate CreditGrade Term
## <fct> <int> <dttm> <fct> <int>
## 1 102133976686814541… 193129 2007-08-26 19:09:29 C 36
## 2 10273602499503308B… 1209647 2014-02-27 08:28:07 <NA> 36
## 3 0EE933782585103286… 81716 2007-01-05 15:00:47 HR 36
## 4 0EF535600248271529… 658116 2012-10-22 11:02:35 <NA> 36
## 5 0F023589499656230C… 909464 2013-09-14 18:38:39 <NA> 36
## 6 0F0535973482419938… 1074836 2013-12-14 08:26:37 <NA> 60
## 7 0F0A3576754255009D… 750899 2013-04-12 09:52:56 <NA> 36
## 8 0F1035772717087366… 768193 2013-05-05 06:49:27 <NA> 36
## 9 0F043596202561788E… 1023355 2013-12-02 10:43:39 <NA> 36
## 10 0F043596202561788E… 1023355 2013-12-02 10:43:39 <NA> 36
## # ... with 113,927 more rows, and 2 more variables: LoanStatus <fct>,
## # ClosedDate <dttm>
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : POSIXct, format: "2007-08-26 19:09:29" "2014-02-27 08:28:07" ...
## $ CreditGrade : Factor w/ 8 levels "A","AA","B","C",..: 4 NA 7 NA NA NA NA NA NA NA ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : POSIXct, format: "2009-08-14" NA ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating (numeric) : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating (Alpha) : Factor w/ 7 levels "A","AA","B","C",..: NA 1 NA 1 5 3 6 4 2 2 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory (numeric) : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 51 levels "AK","AL","AR",..: 6 6 11 11 24 33 17 5 15 15 ...
## $ Occupation : Factor w/ 67 levels "Accountant/CPA",..: 36 42 36 51 20 42 49 28 23 23 ...
## $ EmploymentStatus : Factor w/ 8 levels "Employed","Full-time",..: 8 1 3 1 1 1 1 1 1 1 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 706 levels "00343376901312423168731",..: NA NA 334 NA NA NA NA NA NA NA ...
## $ DateCreditPulled : POSIXct, format: "2007-08-26 18:41:46" "2014-02-27 08:28:14" ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : POSIXct, format: "2001-10-11" "1996-03-18" ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent (percentage) : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : POSIXct, format: "2007-09-12" "2014-03-03" ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
The first thing I notice is that there are several date columns which should be formatted as such, and several boolean (True/False) type columns. I also want to order the levels in some of the factor columns, as they are inherently ordered (CreditGrade, ProsperRating.alpha, IncomeRange, LoanOriginationQuarter). Several of the columns have spaces or special characters in the column names, which makes it difficult to refer to these columns - I will rename these.
data %<>%
mutate_at(c("ListingCreationDate","ClosedDate","DateCreditPulled","FirstRecordedCreditLine","LoanOriginationDate"), as.Date) %>%
mutate_at(c("IsBorrowerHomeowner","CurrentlyInGroup","IncomeVerifiable"), as.logical) %>%
rename_all(~sub(" (numeric)", ".num", ., fixed=TRUE)) %>%
rename_all(~sub(" (Alpha)", ".alpha", ., fixed=TRUE)) %>%
rename_all(~sub(" (percentage)", ".per", ., fixed=TRUE))
data$CreditGrade <- ordered(data$CreditGrade, c("NC","HR","E","D","C","B","A","AA"))
data$ProsperRating.alpha <- ordered(data$ProsperRating.alpha, c("NC","HR","E","D","C","B","A","AA"))
data$IncomeRange <- ordered(data$IncomeRange, c("Not displayed","Not employed","$0","$1-24,999","$25,000-49,999","$50,000-74,999","$75,000-99,999","$100,000+"))
data$LoanOriginationQuarter <- ordered(data$LoanOriginationQuarter, c("Q1 2006", "Q2 2006", "Q3 2006", "Q4 2006", "Q1 2007", "Q2 2007", "Q3 2007", "Q4 2007", "Q1 2008", "Q2 2008", "Q3 2008", "Q4 2008", "Q1 2009", "Q2 2009", "Q3 2009", "Q4 2009", "Q1 2010", "Q2 2010", "Q3 2010", "Q4 2010", "Q1 2011", "Q2 2011", "Q3 2011", "Q4 2011", "Q1 2012", "Q2 2012", "Q3 2012", "Q4 2012", "Q1 2013", "Q2 2013", "Q3 2013", "Q4 2013", "Q1 2014", "Q2 2014", "Q3 2014", "Q4 2014"))
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Date, format: "2007-08-26" "2014-02-27" ...
## $ CreditGrade : Ord.factor w/ 8 levels "NC"<"HR"<"E"<..: 5 NA 2 NA NA NA NA NA NA NA ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Date, format: "2009-08-14" NA ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating.num : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating.alpha : Ord.factor w/ 8 levels "NC"<"HR"<"E"<..: NA 7 NA 7 4 6 3 5 8 8 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory.num : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 51 levels "AK","AL","AR",..: 6 6 11 11 24 33 17 5 15 15 ...
## $ Occupation : Factor w/ 67 levels "Accountant/CPA",..: 36 42 36 51 20 42 49 28 23 23 ...
## $ EmploymentStatus : Factor w/ 8 levels "Employed","Full-time",..: 8 1 3 1 1 1 1 1 1 1 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : logi TRUE FALSE FALSE TRUE TRUE TRUE ...
## $ CurrentlyInGroup : logi TRUE FALSE TRUE FALSE FALSE FALSE ...
## $ GroupKey : Factor w/ 706 levels "00343376901312423168731",..: NA NA 334 NA NA NA NA NA NA NA ...
## $ DateCreditPulled : Date, format: "2007-08-26" "2014-02-27" ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Date, format: "2001-10-11" "1996-03-18" ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent.per : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Ord.factor w/ 8 levels "Not displayed"<..: 5 6 1 5 8 8 5 5 5 5 ...
## $ IncomeVerifiable : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Date, format: "2007-09-12" "2014-03-03" ...
## $ LoanOriginationQuarter : Ord.factor w/ 36 levels "Q1 2006"<"Q2 2006"<..: 7 33 5 28 31 32 30 30 32 32 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
Now I want to take a look at a summary of the data, to try to figure out what might be going on:
summary(data)
## ListingKey ListingNumber ListingCreationDate
## 17A93590655669644DB4C06: 6 Min. : 4 Min. :2005-11-09
## 349D3587495831350F0F648: 4 1st Qu.: 400919 1st Qu.:2008-09-19
## 47C1359638497431975670B: 4 Median : 600554 Median :2012-06-16
## 8474358854651984137201C: 4 Mean : 627886 Mean :2011-07-08
## DE8535960513435199406CE: 4 3rd Qu.: 892634 3rd Qu.:2013-09-09
## 04C13599434217079754AEE: 3 Max. :1255725 Max. :2014-03-10
## (Other) :113912
## CreditGrade Term LoanStatus
## C : 5649 Min. :12.00 Current :56576
## D : 5153 1st Qu.:36.00 Completed :38074
## B : 4389 Median :36.00 Chargedoff :11992
## AA : 3509 Mean :40.83 Defaulted : 5018
## HR : 3508 3rd Qu.:36.00 Past Due (1-15 days) : 806
## (Other): 6745 Max. :60.00 Past Due (31-60 days): 363
## NA's :84984 (Other) : 1108
## ClosedDate BorrowerAPR BorrowerRate LenderYield
## Min. :2005-11-25 Min. :0.00653 Min. :0.0000 Min. :-0.0100
## 1st Qu.:2009-07-14 1st Qu.:0.15629 1st Qu.:0.1340 1st Qu.: 0.1242
## Median :2011-04-05 Median :0.20976 Median :0.1840 Median : 0.1730
## Mean :2011-03-07 Mean :0.21883 Mean :0.1928 Mean : 0.1827
## 3rd Qu.:2013-01-30 3rd Qu.:0.28381 3rd Qu.:0.2500 3rd Qu.: 0.2400
## Max. :2014-03-10 Max. :0.51229 Max. :0.4975 Max. : 0.4925
## NA's :58848 NA's :25
## EstimatedEffectiveYield EstimatedLoss EstimatedReturn
## Min. :-0.183 Min. :0.005 Min. :-0.183
## 1st Qu.: 0.116 1st Qu.:0.042 1st Qu.: 0.074
## Median : 0.162 Median :0.072 Median : 0.092
## Mean : 0.169 Mean :0.080 Mean : 0.096
## 3rd Qu.: 0.224 3rd Qu.:0.112 3rd Qu.: 0.117
## Max. : 0.320 Max. :0.366 Max. : 0.284
## NA's :29084 NA's :29084 NA's :29084
## ProsperRating.num ProsperRating.alpha ProsperScore ListingCategory.num
## Min. :1.000 C :18345 Min. : 1.00 Min. : 0.000
## 1st Qu.:3.000 B :15581 1st Qu.: 4.00 1st Qu.: 1.000
## Median :4.000 A :14551 Median : 6.00 Median : 1.000
## Mean :4.072 D :14274 Mean : 5.95 Mean : 2.774
## 3rd Qu.:5.000 E : 9795 3rd Qu.: 8.00 3rd Qu.: 3.000
## Max. :7.000 (Other):12307 Max. :11.00 Max. :20.000
## NA's :29084 NA's :29084 NA's :29084
## BorrowerState Occupation EmploymentStatus
## CA :14717 Other :28617 Employed :67322
## TX : 6842 Professional :13628 Full-time :26355
## NY : 6729 Computer Programmer: 4478 Self-employed: 6134
## FL : 6720 Executive : 4311 Not available: 5347
## IL : 5921 Teacher : 3759 Other : 3806
## (Other):67493 (Other) :55556 (Other) : 2718
## NA's : 5515 NA's : 3588 NA's : 2255
## EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
## Min. : 0.00 Mode :logical Mode :logical
## 1st Qu.: 26.00 FALSE:56459 FALSE:101218
## Median : 67.00 TRUE :57478 TRUE :12719
## Mean : 96.07
## 3rd Qu.:137.00
## Max. :755.00
## NA's :7625
## GroupKey DateCreditPulled
## 783C3371218786870A73D20: 1140 Min. :2005-11-09
## 3D4D3366260257624AB272D: 916 1st Qu.:2008-09-16
## 6A3B336601725506917317E: 698 Median :2012-06-17
## FEF83377364176536637E50: 611 Mean :2011-07-09
## C9643379247860156A00EC0: 342 3rd Qu.:2013-09-11
## (Other) : 9634 Max. :2014-03-10
## NA's :100596
## CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine
## Min. : 0.0 Min. : 19.0 Min. :1947-08-24
## 1st Qu.:660.0 1st Qu.:679.0 1st Qu.:1990-06-01
## Median :680.0 Median :699.0 Median :1995-11-01
## Mean :685.6 Mean :704.6 Mean :1994-11-17
## 3rd Qu.:720.0 3rd Qu.:739.0 3rd Qu.:2000-03-14
## Max. :880.0 Max. :899.0 Max. :2012-12-22
## NA's :591 NA's :591 NA's :697
## CurrentCreditLines OpenCreditLines TotalCreditLinespast7years
## Min. : 0.00 Min. : 0.00 Min. : 2.00
## 1st Qu.: 7.00 1st Qu.: 6.00 1st Qu.: 17.00
## Median :10.00 Median : 9.00 Median : 25.00
## Mean :10.32 Mean : 9.26 Mean : 26.75
## 3rd Qu.:13.00 3rd Qu.:12.00 3rd Qu.: 35.00
## Max. :59.00 Max. :54.00 Max. :136.00
## NA's :7604 NA's :7604 NA's :697
## OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months
## Min. : 0.00 Min. : 0.0 Min. : 0.000
## 1st Qu.: 4.00 1st Qu.: 114.0 1st Qu.: 0.000
## Median : 6.00 Median : 271.0 Median : 1.000
## Mean : 6.97 Mean : 398.3 Mean : 1.435
## 3rd Qu.: 9.00 3rd Qu.: 525.0 3rd Qu.: 2.000
## Max. :51.00 Max. :14985.0 Max. :105.000
## NA's :697
## TotalInquiries CurrentDelinquencies AmountDelinquent
## Min. : 0.000 Min. : 0.0000 Min. : 0.0
## 1st Qu.: 2.000 1st Qu.: 0.0000 1st Qu.: 0.0
## Median : 4.000 Median : 0.0000 Median : 0.0
## Mean : 5.584 Mean : 0.5921 Mean : 984.5
## 3rd Qu.: 7.000 3rd Qu.: 0.0000 3rd Qu.: 0.0
## Max. :379.000 Max. :83.0000 Max. :463881.0
## NA's :1159 NA's :697 NA's :7622
## DelinquenciesLast7Years PublicRecordsLast10Years
## Min. : 0.000 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 0.000 Median : 0.0000
## Mean : 4.155 Mean : 0.3126
## 3rd Qu.: 3.000 3rd Qu.: 0.0000
## Max. :99.000 Max. :38.0000
## NA's :990 NA's :697
## PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization
## Min. : 0.000 Min. : 0 Min. :0.000
## 1st Qu.: 0.000 1st Qu.: 3121 1st Qu.:0.310
## Median : 0.000 Median : 8549 Median :0.600
## Mean : 0.015 Mean : 17599 Mean :0.561
## 3rd Qu.: 0.000 3rd Qu.: 19521 3rd Qu.:0.840
## Max. :20.000 Max. :1435667 Max. :5.950
## NA's :7604 NA's :7604 NA's :7604
## AvailableBankcardCredit TotalTrades TradesNeverDelinquent.per
## Min. : 0 Min. : 0.00 Min. :0.000
## 1st Qu.: 880 1st Qu.: 15.00 1st Qu.:0.820
## Median : 4100 Median : 22.00 Median :0.940
## Mean : 11210 Mean : 23.23 Mean :0.886
## 3rd Qu.: 13180 3rd Qu.: 30.00 3rd Qu.:1.000
## Max. :646285 Max. :126.00 Max. :1.000
## NA's :7544 NA's :7544 NA's :7544
## TradesOpenedLast6Months DebtToIncomeRatio IncomeRange
## Min. : 0.000 Min. : 0.000 $25,000-49,999:32192
## 1st Qu.: 0.000 1st Qu.: 0.140 $50,000-74,999:31050
## Median : 0.000 Median : 0.220 $100,000+ :17337
## Mean : 0.802 Mean : 0.276 $75,000-99,999:16916
## 3rd Qu.: 1.000 3rd Qu.: 0.320 Not displayed : 7741
## Max. :20.000 Max. :10.010 $1-24,999 : 7274
## NA's :7544 NA's :8554 (Other) : 1427
## IncomeVerifiable StatedMonthlyIncome LoanKey
## Mode :logical Min. : 0 CB1B37030986463208432A1: 6
## FALSE:8669 1st Qu.: 3200 2DEE3698211017519D7333F: 4
## TRUE :105268 Median : 4667 9F4B37043517554537C364C: 4
## Mean : 5608 D895370150591392337ED6D: 4
## 3rd Qu.: 6825 E6FB37073953690388BC56D: 4
## Max. :1750003 0D8F37036734373301ED419: 3
## (Other) :113912
## TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments
## Min. :0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.:1.00 1st Qu.: 9.00 1st Qu.: 9.00
## Median :1.00 Median : 16.00 Median : 15.00
## Mean :1.42 Mean : 22.93 Mean : 22.27
## 3rd Qu.:2.00 3rd Qu.: 33.00 3rd Qu.: 32.00
## Max. :8.00 Max. :141.00 Max. :141.00
## NA's :91852 NA's :91852 NA's :91852
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 0.00
## Mean : 0.61 Mean : 0.05
## 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :42.00 Max. :21.00
## NA's :91852 NA's :91852
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## Min. : 0 Min. : 0
## 1st Qu.: 3500 1st Qu.: 0
## Median : 6000 Median : 1627
## Mean : 8472 Mean : 2930
## 3rd Qu.:11000 3rd Qu.: 4127
## Max. :72499 Max. :23451
## NA's :91852 NA's :91852
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## Min. :-209.00 Min. : 0.0
## 1st Qu.: -35.00 1st Qu.: 0.0
## Median : -3.00 Median : 0.0
## Mean : -3.22 Mean : 152.8
## 3rd Qu.: 25.00 3rd Qu.: 0.0
## Max. : 286.00 Max. :2704.0
## NA's :95009
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber
## Min. : 0.00 Min. : 0.0 Min. : 1
## 1st Qu.: 9.00 1st Qu.: 6.0 1st Qu.: 37332
## Median :14.00 Median : 21.0 Median : 68599
## Mean :16.27 Mean : 31.9 Mean : 69444
## 3rd Qu.:22.00 3rd Qu.: 65.0 3rd Qu.:101901
## Max. :44.00 Max. :100.0 Max. :136486
## NA's :96985
## LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## Min. : 1000 Min. :2005-11-15 Q4 2013:14450
## 1st Qu.: 4000 1st Qu.:2008-10-02 Q1 2014:12172
## Median : 6500 Median :2012-06-26 Q3 2013: 9180
## Mean : 8337 Mean :2011-07-21 Q2 2013: 7099
## 3rd Qu.:12000 3rd Qu.:2013-09-18 Q3 2012: 5632
## Max. :35000 Max. :2014-03-12 (Other):65382
## NA's : 22
## MemberKey MonthlyLoanPayment LP_CustomerPayments
## 63CA34120866140639431C9: 9 Min. : 0.0 Min. : -2.35
## 16083364744933457E57FB9: 8 1st Qu.: 131.6 1st Qu.: 1005.76
## 3A2F3380477699707C81385: 8 Median : 217.7 Median : 2583.83
## 4D9C3403302047712AD0CDD: 8 Mean : 272.5 Mean : 4183.08
## 739C338135235294782AE75: 8 3rd Qu.: 371.6 3rd Qu.: 5548.40
## 7E1733653050264822FAA3D: 8 Max. :2251.5 Max. :40702.39
## (Other) :113888
## LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## Min. : 0.0 Min. : -2.35 Min. :-664.87
## 1st Qu.: 500.9 1st Qu.: 274.87 1st Qu.: -73.18
## Median : 1587.5 Median : 700.84 Median : -34.44
## Mean : 3105.5 Mean : 1077.54 Mean : -54.73
## 3rd Qu.: 4000.0 3rd Qu.: 1458.54 3rd Qu.: -13.92
## Max. :35000.0 Max. :15617.03 Max. : 32.06
##
## LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## Min. :-9274.75 Min. : -94.2 Min. : -954.5
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.00 Median : 0.0 Median : 0.0
## Mean : -14.24 Mean : 700.4 Mean : 681.4
## 3rd Qu.: 0.00 3rd Qu.: 0.0 3rd Qu.: 0.0
## Max. : 0.00 Max. :25000.0 Max. :25000.0
##
## LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## Min. : 0.00 Min. :0.7000 Min. : 0.00000
## 1st Qu.: 0.00 1st Qu.:1.0000 1st Qu.: 0.00000
## Median : 0.00 Median :1.0000 Median : 0.00000
## Mean : 25.14 Mean :0.9986 Mean : 0.04803
## 3rd Qu.: 0.00 3rd Qu.:1.0000 3rd Qu.: 0.00000
## Max. :21117.90 Max. :1.0125 Max. :39.00000
##
## InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## Min. : 0.00000 Min. : 0.00 Min. : 1.00
## 1st Qu.: 0.00000 1st Qu.: 0.00 1st Qu.: 2.00
## Median : 0.00000 Median : 0.00 Median : 44.00
## Mean : 0.02346 Mean : 16.55 Mean : 80.48
## 3rd Qu.: 0.00000 3rd Qu.: 0.00 3rd Qu.: 115.00
## Max. :33.00000 Max. :25000.00 Max. :1189.00
##
I first see that there’s a lot of missing data in many of the columns - it’s not clear to me immediately whether this indicates that the data for those rows is truly missing (but theoretically could have been gathered), or if the information in those columns was simply not applicable to those rows. I will sort this out as I move through the data, but I want to see if some information is, for example, only entered once the loan has been closed or completed. First, though, I will identify the factors of interest.
Prosper Loans, through cursory research (https://en.wikipedia.org/wiki/Prosper_Marketplace), appears to be a peer-to-peer lending company. The primary concern of companies is profit, and in this case, as I see no obvious measure of profit to the company itself, I will focus on profit to the lender (the lenders, presumably, keep the company in business). Of course, borrowers likewise keep the company in business, and given the measures collected, it’s possible to at least take a look at how borrower demographics influence loan funding. Variable names are cross-referenced with a document linked from the Kaggle site: https://docs.google.com/spreadsheets/d/1gDyi_L4UvIrLTEC6Wri5nbaMmkGmLQBk-Yx3z0XDEtI/edit#gid=0.
The factors of most interest to lenders, I assume, might be (for example) LoanStatus (whether a loan is in good standing, repaid, or written off, etc.), LenderYield (yield minus servicing fee), EstimatedEffectiveYield (yield minus servicing fee and uncollected interest, and plus late fees) - likely more informative than the preceding, EstimatedReturn (), EstimatedLoss (loss on charge-offs), LoanCurrentDaysDelinquent, LP_GrossPrincipalLoss, and LP_NetPrincipalLoss. These seem most indicative of how much lenders might profit, or lose, from any particular borrower. What the lender should care most about, overall, is the ability to predict whether (or to what degree) a given (current or future) loan will pay off. In some cases, it is unclear from the documentation whether these are predictions assigned by Prosper at the outset, or descriptions of what actually happened during the course of loans. Exploring the data might shed some light on this.
On the other hand, the factors I intuitively expect might be predictive of profit are the following (for example): CreditGrade (credit assigned when the listing went live), ProsperRating (rating assigned when the loan went live), ProsperScore (risk score), EstimatedReturn (predicted difference between estimated effective yield and estimated loss), ListingCategory (what the loan is for), Occupation, EmploymentStatus, EmploymentStatusDuration, IsBorrowerHomeowner, CreditScoreRangeLower/CreditScoreRangeUpper, FirstRecordedCreditLine, CurrentCreditLines, OpenCreditLines, TotalCreditLinespast7years, OpenRevolvingAccounts, OpenRevolvingMonthlyPayment, InquiriesLast6Months, TotalInquiries, CurrentDelinquencies, AmountDelinquent, DelinquenciesLast7Years, PublicRecordsLast10Years, PublicRecordsLast12Months, RevolvingCreditBalance, BankcardUtilization, AvailableBankcardCredit, TotalTrades (number of trade lines ever opened), TradesNeverDelinquent, TradesOpenedLast6Months, DebtToIncomeRatio, IncomeRange, IncomeVerifiable, StatedMonthlyIncome, TotalProsperLoans (prior Prosper loans), TotalProsperPaymentsBilled (presumably, number of payments billed at time of listing), OnTimeProsperPayments (number of on-time payments at time of listing), ProsperPaymentsLessThanOneMonthLate, ProsperPaymentsOneMonthPlusLate, ProsperPrincipalBorrowed (amount borrowed at time of listing), ProsperPrincipalOutstanding (amount outstanding at time of listing), Recommendations (number of recommendations at time of listing), InvestmentFromFriendsCount (number of friends investing), andInvestmentFromFriendsAmount (amount invested by friends), and Investors (total number of investors). There are too many of these categories, and I expect to narrow the list I will look at down to a few, particularly when multiple measures reflect more-or-less the same thing, or don’t show any distinct patterns of correlating with other variables.
With respect to loan funding, some of the same predictors likely also influence loan amounts and borrower funding, as most likely reflected by BorrowerAPR, BorrowerRate, LoanOriginalAmount, MonthlyLoanPayment, Term (the length of the loan), and PercentFunded (although this is likely to not be informative for recently created loans).
The borrowers and loans are primary indexed through the variables MemberKey and LoanNumber. Additional variables for keeping track of loans include LoanOriginationDate and LoanOriginationQuarter. ClosedDate is useful for quickly indexing loans which have been closed, and for which firm conclusions can be drawn as to how much lenders profited.
Here I want to double-check why information might be missing (e.g., whether some variables are assigned only once a loan has been closed).
closed <- round(colMeans(is.na(filter(data, !is.na(ClosedDate))))*100,2)
not_closed <- round(colMeans(is.na(filter(data, is.na(ClosedDate))))*100,2)
data.frame(closed, not_closed)
## closed not_closed
## ListingKey 0.00 0.00
## ListingNumber 0.00 0.00
## ListingCreationDate 0.00 0.00
## CreditGrade 47.44 100.00
## Term 0.00 0.00
## LoanStatus 0.00 0.00
## ClosedDate 0.00 100.00
## BorrowerAPR 0.05 0.00
## BorrowerRate 0.00 0.00
## LenderYield 0.00 0.00
## EstimatedEffectiveYield 52.79 0.00
## EstimatedLoss 52.79 0.00
## EstimatedReturn 52.79 0.00
## ProsperRating.num 52.79 0.00
## ProsperRating.alpha 52.79 0.00
## ProsperScore 52.79 0.00
## ListingCategory.num 0.00 0.00
## BorrowerState 10.01 0.00
## Occupation 4.12 2.24
## EmploymentStatus 4.09 0.00
## EmploymentStatusDuration 13.82 0.02
## IsBorrowerHomeowner 0.00 0.00
## CurrentlyInGroup 0.00 0.00
## GroupKey 77.00 98.86
## DateCreditPulled 0.00 0.00
## CreditScoreRangeLower 1.07 0.00
## CreditScoreRangeUpper 1.07 0.00
## FirstRecordedCreditLine 1.27 0.00
## CurrentCreditLines 13.80 0.00
## OpenCreditLines 13.80 0.00
## TotalCreditLinespast7years 1.27 0.00
## OpenRevolvingAccounts 0.00 0.00
## OpenRevolvingMonthlyPayment 0.00 0.00
## InquiriesLast6Months 1.27 0.00
## TotalInquiries 2.10 0.00
## CurrentDelinquencies 1.27 0.00
## AmountDelinquent 13.84 0.00
## DelinquenciesLast7Years 1.80 0.00
## PublicRecordsLast10Years 1.27 0.00
## PublicRecordsLast12Months 13.80 0.00
## RevolvingCreditBalance 13.80 0.00
## BankcardUtilization 13.80 0.00
## AvailableBankcardCredit 13.69 0.00
## TotalTrades 13.69 0.00
## TradesNeverDelinquent.per 13.69 0.00
## TradesOpenedLast6Months 13.69 0.00
## DebtToIncomeRatio 7.68 7.35
## IncomeRange 0.00 0.00
## IncomeVerifiable 0.00 0.00
## StatedMonthlyIncome 0.00 0.00
## LoanKey 0.00 0.00
## TotalProsperLoans 80.87 80.38
## TotalProsperPaymentsBilled 80.87 80.38
## OnTimeProsperPayments 80.87 80.38
## ProsperPaymentsLessThanOneMonthLate 80.87 80.38
## ProsperPaymentsOneMonthPlusLate 80.87 80.38
## ProsperPrincipalBorrowed 80.87 80.38
## ProsperPrincipalOutstanding 80.87 80.38
## ScorexChangeAtTimeOfListing 81.05 85.58
## LoanCurrentDaysDelinquent 0.00 0.00
## LoanFirstDefaultedCycleNumber 69.24 99.99
## LoanMonthsSinceOrigination 0.00 0.00
## LoanNumber 0.00 0.00
## LoanOriginalAmount 0.00 0.00
## LoanOriginationDate 0.00 0.00
## LoanOriginationQuarter 0.04 0.00
## MemberKey 0.00 0.00
## MonthlyLoanPayment 0.00 0.00
## LP_CustomerPayments 0.00 0.00
## LP_CustomerPrincipalPayments 0.00 0.00
## LP_InterestandFees 0.00 0.00
## LP_ServiceFees 0.00 0.00
## LP_CollectionFees 0.00 0.00
## LP_GrossPrincipalLoss 0.00 0.00
## LP_NetPrincipalLoss 0.00 0.00
## LP_NonPrincipalRecoverypayments 0.00 0.00
## PercentFunded 0.00 0.00
## Recommendations 0.00 0.00
## InvestmentFromFriendsCount 0.00 0.00
## InvestmentFromFriendsAmount 0.00 0.00
## Investors 0.00 0.00
The first thing I notice is that whether a loan is closed, or not, is quite, but in most cases not entirely, predictive of whether missing values are present, or not.
None of the open loans have a credit grade, while about half of the closed loans do. I assume that those which do are post-July 2009 loans, which were never assigned a credit grade.
summary(filter(data, !is.na(ClosedDate) & is.na(CreditGrade)))
## ListingKey ListingNumber ListingCreationDate
## 018A360063948152589C8BE: 2 Min. : 149172 Min. :2007-06-08
## 30F435938764424435A1188: 2 1st Qu.: 479472 1st Qu.:2010-10-12
## 32943590099161153292459: 2 Median : 529900 Median :2011-09-28
## 6DFC3591891372387BB41B2: 2 Mean : 554859 Mean :2011-08-17
## 778D35919242972923313E0: 2 3rd Qu.: 600118 3rd Qu.:2012-06-14
## 82FD35914405776692938D4: 2 Max. :1204824 Max. :2014-02-13
## (Other) :26124
## CreditGrade Term LoanStatus
## NC : 0 Min. :12.00 Completed :19786
## HR : 0 1st Qu.:36.00 Chargedoff : 5342
## E : 0 Median :36.00 Defaulted : 1008
## D : 0 Mean :37.99 Cancelled : 0
## C : 0 3rd Qu.:36.00 Current : 0
## (Other): 0 Max. :60.00 FinalPaymentInProgress: 0
## NA's :26136 (Other) : 0
## ClosedDate BorrowerAPR BorrowerRate LenderYield
## Min. :2009-08-27 Min. :0.04583 Min. :0.0400 Min. :0.0300
## 1st Qu.:2012-06-12 1st Qu.:0.17359 1st Qu.:0.1469 1st Qu.:0.1369
## Median :2013-02-20 Median :0.26798 Median :0.2300 Median :0.2200
## Mean :2012-12-20 Mean :0.25118 Mean :0.2193 Mean :0.2093
## 3rd Qu.:2013-09-10 3rd Qu.:0.33553 3rd Qu.:0.2958 3rd Qu.:0.2858
## Max. :2014-03-10 Max. :0.42395 Max. :0.3600 Max. :0.3400
##
## EstimatedEffectiveYield EstimatedLoss EstimatedReturn
## Min. :-0.1827 Min. :0.00490 Min. :-0.1827
## 1st Qu.: 0.1106 1st Qu.:0.05200 1st Qu.: 0.0780
## Median : 0.1715 Median :0.09800 Median : 0.1144
## Mean : 0.1762 Mean :0.09379 Mean : 0.1075
## 3rd Qu.: 0.2469 3rd Qu.:0.14050 3rd Qu.: 0.1363
## Max. : 0.3199 Max. :0.36600 Max. : 0.2837
## NA's :131 NA's :131 NA's :131
## ProsperRating.num ProsperRating.alpha ProsperScore
## Min. :1.000 D :5869 Min. : 1.000
## 1st Qu.:2.000 E :3830 1st Qu.: 5.000
## Median :3.000 C :3817 Median : 6.000
## Mean :3.663 HR :3725 Mean : 6.266
## 3rd Qu.:5.000 A :3608 3rd Qu.: 8.000
## Max. :7.000 (Other):5156 Max. :11.000
## NA's :131 NA's : 131 NA's :131
## ListingCategory.num BorrowerState Occupation
## Min. : 0.00 CA : 3325 Other : 6786
## 1st Qu.: 1.00 FL : 1768 Professional : 3452
## Median : 2.00 NY : 1639 Computer Programmer : 1261
## Mean : 3.75 TX : 1562 Administrative Assistant: 959
## 3rd Qu.: 7.00 IL : 1389 Executive : 950
## Max. :20.00 GA : 1127 (Other) :12715
## (Other):15326 NA's : 13
## EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner
## Employed :16491 Min. : 0.00 Mode :logical
## Full-time : 6634 1st Qu.: 27.00 FALSE:12814
## Self-employed: 1334 Median : 63.00 TRUE :13322
## Other : 798 Mean : 91.06
## Not employed : 375 3rd Qu.:127.00
## Retired : 273 Max. :755.00
## (Other) : 231 NA's :9
## CurrentlyInGroup GroupKey DateCreditPulled
## Mode :logical 3D4D3366260257624AB272D: 201 Min. :2009-07-13
## FALSE:24741 783C3371218786870A73D20: 134 1st Qu.:2010-10-13
## TRUE :1395 52EA3425051368132B80C96: 109 Median :2011-09-29
## B0473364376920128370B13: 63 Mean :2011-08-21
## FEF83377364176536637E50: 54 3rd Qu.:2012-06-14
## (Other) : 817 Max. :2014-02-13
## NA's :24758
## CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine
## Min. :600.0 Min. :619.0 Min. :1953-09-01
## 1st Qu.:660.0 1st Qu.:679.0 1st Qu.:1990-12-03
## Median :700.0 Median :719.0 Median :1996-04-16
## Mean :701.7 Mean :720.7 Mean :1995-04-06
## 3rd Qu.:740.0 3rd Qu.:759.0 3rd Qu.:2000-05-19
## Max. :880.0 Max. :899.0 Max. :2012-06-19
##
## CurrentCreditLines OpenCreditLines TotalCreditLinespast7years
## Min. : 0.000 Min. : 0.000 Min. : 2.0
## 1st Qu.: 6.000 1st Qu.: 5.000 1st Qu.: 16.0
## Median : 9.000 Median : 8.000 Median : 25.0
## Mean : 9.576 Mean : 8.454 Mean : 26.6
## 3rd Qu.:13.000 3rd Qu.:11.000 3rd Qu.: 35.0
## Max. :59.000 Max. :48.000 Max. :124.0
##
## OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months
## Min. : 0.000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.: 97.0 1st Qu.: 0.000
## Median : 6.000 Median : 231.0 Median : 1.000
## Mean : 6.442 Mean : 349.2 Mean : 1.188
## 3rd Qu.: 9.000 3rd Qu.: 457.0 3rd Qu.: 2.000
## Max. :47.000 Max. :5720.0 Max. :27.000
##
## TotalInquiries CurrentDelinquencies AmountDelinquent
## Min. : 0.000 Min. : 0.0000 Min. : 0.0
## 1st Qu.: 2.000 1st Qu.: 0.0000 1st Qu.: 0.0
## Median : 4.000 Median : 0.0000 Median : 0.0
## Mean : 4.646 Mean : 0.3694 Mean : 992.6
## 3rd Qu.: 6.000 3rd Qu.: 0.0000 3rd Qu.: 0.0
## Max. :74.000 Max. :32.0000 Max. :327677.0
##
## DelinquenciesLast7Years PublicRecordsLast10Years
## Min. : 0.000 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 0.000 Median : 0.0000
## Mean : 3.401 Mean : 0.2609
## 3rd Qu.: 2.000 3rd Qu.: 0.0000
## Max. :99.000 Max. :12.0000
##
## PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization
## Min. :0.00000 Min. : 0 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.: 2071 1st Qu.:0.2200
## Median :0.00000 Median : 6798 Median :0.5400
## Mean :0.01144 Mean : 15210 Mean :0.5141
## 3rd Qu.:0.00000 3rd Qu.: 16600 3rd Qu.:0.8100
## Max. :4.00000 Max. :879785 Max. :2.5000
##
## AvailableBankcardCredit TotalTrades TradesNeverDelinquent.per
## Min. : 0.0 Min. : 1.00 Min. :0.1600
## 1st Qu.: 850.8 1st Qu.: 14.00 1st Qu.:0.8300
## Median : 4198.0 Median : 21.00 Median :0.9500
## Mean : 11174.3 Mean : 22.87 Mean :0.8973
## 3rd Qu.: 13414.0 3rd Qu.: 30.00 3rd Qu.:1.0000
## Max. :412785.0 Max. :122.00 Max. :1.0000
##
## TradesOpenedLast6Months DebtToIncomeRatio IncomeRange
## Min. : 0.0000 Min. : 0.0000 $25,000-49,999:8367
## 1st Qu.: 0.0000 1st Qu.: 0.1300 $50,000-74,999:7411
## Median : 0.0000 Median : 0.2000 $75,000-99,999:4041
## Mean : 0.7603 Mean : 0.2488 $100,000+ :3948
## 3rd Qu.: 1.0000 3rd Qu.: 0.3000 $1-24,999 :1964
## Max. :20.0000 Max. :10.0100 Not employed : 375
## NA's :2983 (Other) : 30
## IncomeVerifiable StatedMonthlyIncome LoanKey
## Mode :logical Min. : 0 08C43696561586194AC381C: 2
## FALSE:2976 1st Qu.: 3167 09303699897852595CD59DD: 2
## TRUE :23160 Median : 4583 114D37056655628721BD6C8: 2
## Mean : 5488 156836977849742636AE34F: 2
## 3rd Qu.: 6667 56D73700259224545E36FBC: 2
## Max. :618548 63113695530739927C7EA06: 2
## (Other) :26124
## TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments
## Min. :0.000 Min. : 0.00 Min. : 0.00
## 1st Qu.:1.000 1st Qu.: 9.00 1st Qu.: 9.00
## Median :1.000 Median : 18.00 Median : 18.00
## Mean :1.401 Mean : 22.57 Mean : 21.88
## 3rd Qu.:2.000 3rd Qu.: 33.00 3rd Qu.: 32.00
## Max. :7.000 Max. :120.00 Max. :114.00
## NA's :17826 NA's :17826 NA's :17826
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 0.000
## Mean : 0.635 Mean : 0.058
## 3rd Qu.: 0.000 3rd Qu.: 0.000
## Max. :42.000 Max. :21.000
## NA's :17826 NA's :17826
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## Min. : 0 Min. : 0.0
## 1st Qu.: 3000 1st Qu.: 0.0
## Median : 5000 Median : 824.7
## Mean : 7394 Mean : 2127.9
## 3rd Qu.:10000 3rd Qu.: 3179.1
## Max. :60001 Max. :22586.7
## NA's :17826 NA's :17826
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## Min. :-194.00 Min. : 0.0
## 1st Qu.: -32.00 1st Qu.: 0.0
## Median : -3.00 Median : 0.0
## Mean : -0.29 Mean : 115.9
## 3rd Qu.: 29.00 3rd Qu.: 0.0
## Max. : 286.00 Max. :1593.0
## NA's :17923
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber
## Min. : 1.00 Min. : 1.00 Min. : 38045
## 1st Qu.: 9.00 1st Qu.:21.00 1st Qu.: 45089
## Median :13.00 Median :29.00 Median : 54430
## Mean :14.49 Mean :30.47 Mean : 58559
## 3rd Qu.:19.00 3rd Qu.:41.00 3rd Qu.: 68482
## Max. :41.00 Max. :56.00 Max. :132453
## NA's :19891
## LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## Min. : 1000 Min. :2009-07-20 Q4 2011: 2352
## 1st Qu.: 3000 1st Qu.:2010-10-29 Q2 2012: 2272
## Median : 4500 Median :2011-10-12 Q1 2012: 2252
## Mean : 6365 Mean :2011-09-03 Q3 2012: 2213
## 3rd Qu.: 8000 3rd Qu.:2012-06-25 Q3 2011: 2018
## Max. :35000 Max. :2014-02-21 Q2 2011: 1713
## (Other):13316
## MemberKey MonthlyLoanPayment LP_CustomerPayments
## C70934206057523078260C7: 7 Min. : 0.0 Min. : -2.35
## E4AF3422677498955FFA00E: 7 1st Qu.: 121.6 1st Qu.: 2304.53
## 720D3508651090808DC328F: 6 Median : 175.9 Median : 4561.31
## D65B3496915385104F50CD7: 6 Mean : 232.2 Mean : 6193.82
## E48334334509567416C8C65: 6 3rd Qu.: 314.4 3rd Qu.: 8501.98
## 43DB3366978035224D7D9E3: 5 Max. :2251.5 Max. :37369.16
## (Other) :26099
## LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## Min. : 0 Min. : -2.35 Min. :-589.95
## 1st Qu.: 1795 1st Qu.: 326.71 1st Qu.: -70.74
## Median : 4000 Median : 746.15 Median : -35.07
## Mean : 5128 Mean : 1065.72 Mean : -52.18
## 3rd Qu.: 7000 3rd Qu.: 1487.20 3rd Qu.: -16.07
## Max. :35000 Max. :10013.57 Max. : 3.01
##
## LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## Min. :-4865.08 Min. : -94.2 Min. : -504.4
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.00 Median : 0.0 Median : 0.0
## Mean : -17.25 Mean : 1221.7 Mean : 1194.6
## 3rd Qu.: 0.00 3rd Qu.: 0.0 3rd Qu.: 0.0
## Max. : 0.00 Max. :25000.0 Max. :25000.0
##
## LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## Min. : 0.00 Min. :0.700 Min. : 0.00000
## 1st Qu.: 0.00 1st Qu.:1.000 1st Qu.: 0.00000
## Median : 0.00 Median :1.000 Median : 0.00000
## Mean : 24.83 Mean :0.997 Mean : 0.03646
## 3rd Qu.: 0.00 3rd Qu.:1.000 3rd Qu.: 0.00000
## Max. :7780.03 Max. :1.000 Max. :18.00000
##
## InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## Min. :0.00000 Min. : 0.00 Min. : 1.00
## 1st Qu.:0.00000 1st Qu.: 0.00 1st Qu.: 28.00
## Median :0.00000 Median : 0.00 Median : 62.00
## Mean :0.02124 Mean : 12.94 Mean : 92.67
## 3rd Qu.:0.00000 3rd Qu.: 0.00 3rd Qu.: 125.00
## Max. :9.00000 Max. :11000.00 Max. :1189.00
##
Here, I see that at least one loan prior to 2009 has no credit grade.
summary(filter(data, !is.na(ClosedDate) & is.na(CreditGrade) & ListingCreationDate < "2009-07-01"))
## ListingKey ListingNumber ListingCreationDate
## 0385345033494662260733C: 1 Min. :149172 Min. :2007-06-08
## 04D73431953660481B1EC1D: 1 1st Qu.:306608 1st Qu.:2008-04-08
## 04F334232790941784498F1: 1 Median :339464 Median :2008-05-26
## 05153419481232978723A5F: 1 Mean :341138 Mean :2008-06-24
## 059934165217732065237C5: 1 3rd Qu.:397924 3rd Qu.:2008-09-13
## 06FF342963152332574DF05: 1 Max. :415961 Max. :2009-05-06
## (Other) :125
## CreditGrade Term LoanStatus
## NC : 0 Min. :12.00 Completed :122
## HR : 0 1st Qu.:36.00 Chargedoff : 6
## E : 0 Median :36.00 Defaulted : 3
## D : 0 Mean :35.82 Cancelled : 0
## C : 0 3rd Qu.:36.00 Current : 0
## (Other): 0 Max. :36.00 FinalPaymentInProgress: 0
## NA's :131 (Other) : 0
## ClosedDate BorrowerAPR BorrowerRate
## Min. :2010-01-28 Min. :0.06207 Min. :0.05870
## 1st Qu.:2011-04-21 1st Qu.:0.11271 1st Qu.:0.09025
## Median :2012-04-05 Median :0.17018 Median :0.14000
## Mean :2012-02-01 Mean :0.18688 Mean :0.16300
## 3rd Qu.:2012-10-29 3rd Qu.:0.25811 3rd Qu.:0.22700
## Max. :2013-10-12 Max. :0.39460 Max. :0.35300
##
## LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn
## Min. :0.04870 Min. : NA Min. : NA Min. : NA
## 1st Qu.:0.08025 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
## Median :0.13000 Median : NA Median : NA Median : NA
## Mean :0.15293 Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.:0.21700 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. :0.34000 Max. : NA Max. : NA Max. : NA
## NA's :131 NA's :131 NA's :131
## ProsperRating.num ProsperRating.alpha ProsperScore ListingCategory.num
## Min. : NA NC : 0 Min. : NA Min. :1.000
## 1st Qu.: NA HR : 0 1st Qu.: NA 1st Qu.:1.000
## Median : NA E : 0 Median : NA Median :1.000
## Mean :NaN D : 0 Mean :NaN Mean :2.893
## 3rd Qu.: NA C : 0 3rd Qu.: NA 3rd Qu.:5.000
## Max. : NA (Other): 0 Max. : NA Max. :7.000
## NA's :131 NA's :131 NA's :131
## BorrowerState Occupation EmploymentStatus
## CA :18 Other :30 Full-time :104
## TX :18 Professional :23 Employed : 12
## NY : 9 Analyst : 9 Part-time : 7
## IL : 7 Computer Programmer : 9 Retired : 4
## CT : 6 Administrative Assistant: 5 Self-employed: 4
## MN : 6 Teacher : 5 Not available: 0
## (Other):67 (Other) :50 (Other) : 0
## EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
## Min. : 0.00 Mode :logical Mode :logical
## 1st Qu.: 26.00 FALSE:66 FALSE:107
## Median : 50.00 TRUE :65 TRUE :24
## Mean : 74.24
## 3rd Qu.:105.00
## Max. :472.00
##
## GroupKey DateCreditPulled CreditScoreRangeLower
## 783C3371218786870A73D20: 5 Min. :2009-07-13 Min. :600.0
## 020E3366126106360DB9421: 1 1st Qu.:2009-10-19 1st Qu.:660.0
## 17693364417023401A53169: 1 Median :2010-02-03 Median :720.0
## 18DA336463918236939DCE7: 1 Mean :2010-02-23 Mean :711.1
## 3D4D3366260257624AB272D: 1 3rd Qu.:2010-07-02 3rd Qu.:740.0
## (Other) : 15 Max. :2010-12-19 Max. :860.0
## NA's :107
## CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines
## Min. :619.0 Min. :1959-10-01 Min. : 1.00
## 1st Qu.:679.0 1st Qu.:1992-12-11 1st Qu.: 7.00
## Median :739.0 Median :1996-08-28 Median : 9.00
## Mean :730.1 Mean :1995-06-17 Mean :10.27
## 3rd Qu.:759.0 3rd Qu.:2000-04-07 3rd Qu.:13.00
## Max. :879.0 Max. :2007-09-10 Max. :35.00
##
## OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts
## Min. : 1.000 Min. : 4.00 Min. : 0.000
## 1st Qu.: 5.000 1st Qu.:17.00 1st Qu.: 4.000
## Median : 8.000 Median :22.00 Median : 6.000
## Mean : 8.832 Mean :25.51 Mean : 6.855
## 3rd Qu.:12.000 3rd Qu.:33.00 3rd Qu.: 9.000
## Max. :29.000 Max. :58.00 Max. :29.000
##
## OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries
## Min. : 0.0 Min. :0.000 Min. : 0.000
## 1st Qu.: 90.5 1st Qu.:0.000 1st Qu.: 2.000
## Median : 239.0 Median :0.000 Median : 4.000
## Mean : 309.1 Mean :0.855 Mean : 5.191
## 3rd Qu.: 420.0 3rd Qu.:1.000 3rd Qu.: 8.000
## Max. :1956.0 Max. :9.000 Max. :19.000
##
## CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years
## Min. :0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.: 0.0 1st Qu.: 0.000
## Median :0.0000 Median : 0.0 Median : 0.000
## Mean :0.2824 Mean : 433.7 Mean : 2.718
## 3rd Qu.:0.0000 3rd Qu.: 0.0 3rd Qu.: 0.000
## Max. :8.0000 Max. :31919.0 Max. :43.000
##
## PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
## Min. :0.0000 Min. :0 Min. : 0
## 1st Qu.:0.0000 1st Qu.:0 1st Qu.: 2308
## Median :0.0000 Median :0 Median : 8074
## Mean :0.1756 Mean :0 Mean :12039
## 3rd Qu.:0.0000 3rd Qu.:0 3rd Qu.:16422
## Max. :3.0000 Max. :0 Max. :97290
##
## BankcardUtilization AvailableBankcardCredit TotalTrades
## Min. :0.0000 Min. : 0 Min. : 3.00
## 1st Qu.:0.1800 1st Qu.: 1557 1st Qu.:14.50
## Median :0.4400 Median : 6999 Median :19.00
## Mean :0.4524 Mean : 13522 Mean :22.21
## 3rd Qu.:0.7200 3rd Qu.: 17470 3rd Qu.:29.00
## Max. :0.9900 Max. :110117 Max. :52.00
##
## TradesNeverDelinquent.per TradesOpenedLast6Months DebtToIncomeRatio
## Min. :0.3000 Min. :0.0000 Min. :0.0200
## 1st Qu.:0.8400 1st Qu.:0.0000 1st Qu.:0.1100
## Median :0.9600 Median :0.0000 Median :0.2000
## Mean :0.8996 Mean :0.5725 Mean :0.2500
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.2725
## Max. :1.0000 Max. :5.0000 Max. :5.5900
## NA's :11
## IncomeRange IncomeVerifiable StatedMonthlyIncome
## $50,000-74,999:45 Mode :logical Min. : 212.8
## $25,000-49,999:40 FALSE:11 1st Qu.: 3333.3
## $75,000-99,999:17 TRUE :120 Median : 4616.7
## $100,000+ :16 Mean : 5111.2
## $1-24,999 :13 3rd Qu.: 6375.0
## Not displayed : 0 Max. :20833.3
## (Other) : 0
## LoanKey TotalProsperLoans
## 003C35735230494626ADB02: 1 Min. :1.000
## 02CA35638190585257E0D22: 1 1st Qu.:1.000
## 030B35936026115966F4EA0: 1 Median :1.000
## 032A357638786716375DFFB: 1 Mean :1.153
## 040235782802629332A0C8C: 1 3rd Qu.:1.000
## 05BC35722810324548A02FE: 1 Max. :3.000
## (Other) :125 NA's :72
## TotalProsperPaymentsBilled OnTimeProsperPayments
## Min. : 1.00 Min. : 0.00
## 1st Qu.:14.50 1st Qu.:14.50
## Median :24.00 Median :22.00
## Mean :22.76 Mean :22.54
## 3rd Qu.:34.00 3rd Qu.:33.50
## Max. :42.00 Max. :41.00
## NA's :72 NA's :72
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## Min. :0.0000 Min. :0
## 1st Qu.:0.0000 1st Qu.:0
## Median :0.0000 Median :0
## Mean :0.2203 Mean :0
## 3rd Qu.:0.0000 3rd Qu.:0
## Max. :3.0000 Max. :0
## NA's :72 NA's :72
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## Min. : 1000 Min. : 0.00
## 1st Qu.: 1775 1st Qu.: 0.00
## Median : 4500 Median : 0.00
## Mean : 5491 Mean : 428.24
## 3rd Qu.: 7500 3rd Qu.: 0.25
## Max. :27000 Max. :5788.52
## NA's :72 NA's :72
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## Min. :-50.00 Min. : 0.00
## 1st Qu.: -7.00 1st Qu.: 0.00
## Median : 39.00 Median : 0.00
## Mean : 43.37 Mean : 53.65
## 3rd Qu.: 83.00 3rd Qu.: 0.00
## Max. :215.00 Max. :1257.00
## NA's :74
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber
## Min. :10.00 Min. :39.00 Min. :38046
## 1st Qu.:18.00 1st Qu.:44.00 1st Qu.:39344
## Median :23.00 Median :49.00 Median :40869
## Mean :24.22 Mean :48.34 Mean :41386
## 3rd Qu.:32.00 3rd Qu.:52.00 3rd Qu.:43474
## Max. :37.00 Max. :56.00 Max. :46378
## NA's :122
## LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## Min. : 1000 Min. :2009-07-22 Q4 2009:32
## 1st Qu.: 2000 1st Qu.:2009-11-08 Q3 2009:26
## Median : 3000 Median :2010-02-17 Q2 2010:21
## Mean : 4187 Mean :2010-03-11 Q4 2010:21
## 3rd Qu.: 5000 3rd Qu.:2010-07-18 Q1 2010:17
## Max. :15000 Max. :2010-12-30 Q3 2010:14
## (Other): 0
## MemberKey MonthlyLoanPayment LP_CustomerPayments
## 010B33941340101099BFE47: 1 Min. : 0.00 Min. : 458.2
## 016533808792025682035EE: 1 1st Qu.: 63.24 1st Qu.: 2161.4
## 0CCD3420393708396FB7287: 1 Median :111.95 Median : 3865.5
## 0F1733815422230679CFC01: 1 Mean :146.00 Mean : 4865.0
## 0F5133834635103374519DF: 1 3rd Qu.:188.66 3rd Qu.: 6402.7
## 10D73380714543112C251DF: 1 Max. :578.69 Max. :18748.2
## (Other) :125
## LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## Min. : 204.8 Min. : 11.26 Min. :-242.93
## 1st Qu.: 1946.1 1st Qu.: 254.88 1st Qu.: -62.53
## Median : 3000.0 Median : 546.00 Median : -38.67
## Mean : 4043.8 Mean : 821.17 Mean : -50.11
## 3rd Qu.: 5000.0 3rd Qu.:1143.52 3rd Qu.: -19.86
## Max. :15000.0 Max. :3748.19 Max. : -1.41
##
## LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## Min. :0 Min. : 0.0 Min. : 0.0
## 1st Qu.:0 1st Qu.: 0.0 1st Qu.: 0.0
## Median :0 Median : 0.0 Median : 0.0
## Mean :0 Mean : 145.4 Mean : 145.4
## 3rd Qu.:0 3rd Qu.: 0.0 3rd Qu.: 0.0
## Max. :0 Max. :8911.2 Max. :8911.2
##
## LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## Min. :0 Min. :1 Min. :0.00000
## 1st Qu.:0 1st Qu.:1 1st Qu.:0.00000
## Median :0 Median :1 Median :0.00000
## Mean :0 Mean :1 Mean :0.08397
## 3rd Qu.:0 3rd Qu.:1 3rd Qu.:0.00000
## Max. :0 Max. :1 Max. :2.00000
##
## InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## Min. :0.00000 Min. : 0.00 Min. : 10.0
## 1st Qu.:0.00000 1st Qu.: 0.00 1st Qu.: 75.5
## Median :0.00000 Median : 0.00 Median :124.0
## Mean :0.03817 Mean : 57.97 Mean :155.5
## 3rd Qu.:0.00000 3rd Qu.: 0.00 3rd Qu.:204.0
## Max. :1.00000 Max. :5140.00 Max. :594.0
##
I see that 130 loans are missing a credit grade for no apparent reason. I don’t see any pattern here, and assume that it is impossible right now for me to tell why this data is missing. However, this is a relatively small amount of data.
I am otherwise assuming that CreditGrade was effectively replaced by ProsperScore in 2009, and that these can be used more-or-less interchangeably, particularly given that their labels correspond.
Next, I notice that only about half of the closed loans have estimated effective lender yields or several other estimates of yield/loss, although they are not closed. I assume these are pre-July 2009 listings, but I want to take a closer look at them.
summary(filter(data, !is.na(ClosedDate) & is.na(EstimatedEffectiveYield)))
## ListingKey ListingNumber ListingCreationDate
## 00033425227988088FA6752: 1 Min. : 4 Min. :2005-11-09
## 000433785890431972B4743: 1 1st Qu.: 92588 1st Qu.:2007-02-02
## 00083422661625108817246: 1 Median :199844 Median :2007-09-10
## 000A34209897973969CFA81: 1 Mean :201960 Mean :2007-08-26
## 000D3410451511356B08F17: 1 3rd Qu.:314319 3rd Qu.:2008-04-19
## 00143395229257559A91663: 1 Max. :415961 Max. :2009-05-06
## (Other) :29078
## CreditGrade Term LoanStatus
## C :5649 Min. :12 Completed :18410
## D :5153 1st Qu.:36 Chargedoff : 6656
## B :4389 Median :36 Defaulted : 4013
## AA :3509 Mean :36 Cancelled : 5
## HR :3508 3rd Qu.:36 Current : 0
## (Other):6745 Max. :36 FinalPaymentInProgress: 0
## NA's : 131 (Other) : 0
## ClosedDate BorrowerAPR BorrowerRate LenderYield
## Min. :2005-11-25 Min. :0.00653 Min. :0.0000 Min. :-0.0100
## 1st Qu.:2008-08-25 1st Qu.:0.13705 1st Qu.:0.1269 1st Qu.: 0.1170
## Median :2009-08-17 Median :0.18224 Median :0.1700 Median : 0.1600
## Mean :2009-07-30 Mean :0.19596 Mean :0.1833 Mean : 0.1730
## 3rd Qu.:2010-07-29 3rd Qu.:0.24753 3rd Qu.:0.2364 3rd Qu.: 0.2224
## Max. :2013-10-12 Max. :0.51229 Max. :0.4975 Max. : 0.4925
## NA's :25
## EstimatedEffectiveYield EstimatedLoss EstimatedReturn ProsperRating.num
## Min. : NA Min. : NA Min. : NA Min. : NA
## 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
## Median : NA Median : NA Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA Max. : NA Max. : NA
## NA's :29084 NA's :29084 NA's :29084 NA's :29084
## ProsperRating.alpha ProsperScore ListingCategory.num BorrowerState
## NC : 0 Min. : NA Min. :0.000 CA : 3956
## HR : 0 1st Qu.: NA 1st Qu.:0.000 GA : 1661
## E : 0 Median : NA Median :0.000 IL : 1657
## D : 0 Mean :NaN Mean :1.203 FL : 1314
## C : 0 3rd Qu.: NA 3rd Qu.:1.000 TX : 1208
## (Other): 0 Max. : NA Max. :7.000 (Other):13773
## NA's :29084 NA's :29084 NA's : 5515
## Occupation EmploymentStatus
## Other : 7300 Full-time :18428
## Professional : 3086 Not available: 5347
## Computer Programmer: 1242 Self-employed: 1596
## Sales - Commission : 1096 Part-time : 832
## Clerical : 1048 Retired : 428
## (Other) :13057 (Other) : 198
## NA's : 2255 NA's : 2255
## EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
## Min. : 0.00 Mode :logical Mode :logical
## 1st Qu.: 15.00 FALSE:16454 FALSE:18611
## Median : 40.00 TRUE :12630 TRUE :10473
## Mean : 68.49
## 3rd Qu.: 94.00
## Max. :623.00
## NA's :7606
## GroupKey DateCreditPulled
## 783C3371218786870A73D20: 932 Min. :2005-11-09
## 6A3B336601725506917317E: 619 1st Qu.:2007-01-30
## 3D4D3366260257624AB272D: 606 Median :2007-09-04
## FEF83377364176536637E50: 529 Mean :2007-08-24
## C9643379247860156A00EC0: 342 3rd Qu.:2008-04-17
## (Other) : 8287 Max. :2010-12-19
## NA's :17769
## CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine
## Min. : 0.0 Min. : 19.0 Min. :1947-08-24
## 1st Qu.:600.0 1st Qu.:619.0 1st Qu.:1990-07-26
## Median :640.0 Median :659.0 Median :1995-06-01
## Mean :644.4 Mean :663.4 Mean :1994-08-07
## 3rd Qu.:700.0 3rd Qu.:719.0 3rd Qu.:1999-08-31
## Max. :880.0 Max. :899.0 Max. :2008-07-01
## NA's :591 NA's :591 NA's :697
## CurrentCreditLines OpenCreditLines TotalCreditLinespast7years
## Min. : 0.000 Min. : 0.0 Min. : 2.00
## 1st Qu.: 5.000 1st Qu.: 4.0 1st Qu.: 13.00
## Median : 9.000 Median : 7.0 Median : 22.00
## Mean : 9.563 Mean : 8.2 Mean : 24.06
## 3rd Qu.:13.000 3rd Qu.:11.0 3rd Qu.: 32.00
## Max. :52.000 Max. :51.0 Max. :136.00
## NA's :7604 NA's :7604 NA's :697
## OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months
## Min. : 0.000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 35.0 1st Qu.: 0.000
## Median : 5.000 Median : 139.0 Median : 2.000
## Mean : 5.755 Mean : 303.7 Mean : 2.841
## 3rd Qu.: 8.000 3rd Qu.: 374.0 3rd Qu.: 4.000
## Max. :51.000 Max. :14985.0 Max. :105.000
## NA's :697
## TotalInquiries CurrentDelinquencies AmountDelinquent
## Min. : 0.000 Min. : 0.000 Min. : 0
## 1st Qu.: 3.000 1st Qu.: 0.000 1st Qu.: 0
## Median : 7.000 Median : 0.000 Median : 0
## Mean : 9.516 Mean : 1.398 Mean : 1118
## 3rd Qu.: 13.000 3rd Qu.: 1.000 3rd Qu.: 30
## Max. :379.000 Max. :83.000 Max. :444745
## NA's :1159 NA's :697 NA's :7622
## DelinquenciesLast7Years PublicRecordsLast10Years
## Min. : 0.000 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 0.000 Median : 0.0000
## Mean : 5.652 Mean : 0.3949
## 3rd Qu.: 6.000 3rd Qu.: 1.0000
## Max. :99.000 Max. :30.0000
## NA's :990 NA's :697
## PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization
## Min. :0.000 Min. : 0 Min. :0.00
## 1st Qu.:0.000 1st Qu.: 1192 1st Qu.:0.20
## Median :0.000 Median : 5206 Median :0.60
## Mean :0.039 Mean : 16250 Mean :0.55
## 3rd Qu.:0.000 3rd Qu.: 15590 3rd Qu.:0.88
## Max. :7.000 Max. :1435667 Max. :5.95
## NA's :7604 NA's :7604 NA's :7604
## AvailableBankcardCredit TotalTrades TradesNeverDelinquent.per
## Min. : 0 Min. : 0.00 Min. :0.000
## 1st Qu.: 253 1st Qu.: 11.00 1st Qu.:0.690
## Median : 2277 Median : 18.00 Median :0.870
## Mean : 10460 Mean : 20.48 Mean :0.807
## 3rd Qu.: 10162 3rd Qu.: 28.00 3rd Qu.:1.000
## Max. :646285 Max. :126.00 Max. :1.000
## NA's :7544 NA's :7544 NA's :7544
## TradesOpenedLast6Months DebtToIncomeRatio IncomeRange
## Min. : 0.000 Min. : 0.0000 $25,000-49,999:8017
## 1st Qu.: 0.000 1st Qu.: 0.1200 Not displayed :7741
## Median : 1.000 Median : 0.2000 $50,000-74,999:5423
## Mean : 1.088 Mean : 0.3239 $1-24,999 :2620
## 3rd Qu.: 2.000 3rd Qu.: 0.3000 $75,000-99,999:2418
## Max. :17.000 Max. :10.0100 $100,000+ :2132
## NA's :7544 NA's :1258 (Other) : 733
## IncomeVerifiable StatedMonthlyIncome LoanKey
## Mode :logical Min. : 0 00013421083473792D70F75: 1
## FALSE:1336 1st Qu.: 2500 000534180797040005C07AA: 1
## TRUE :27748 Median : 3833 00093413855467649508680: 1
## Mean : 4665 000B3366346245964D6187E: 1
## 3rd Qu.: 5752 000B34179327090460D3429: 1
## Max. :208333 000E3392089465002A7DBA0: 1
## (Other) :29078
## TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments
## Min. :1.000 Min. : 0.00 Min. : 0.00
## 1st Qu.:1.000 1st Qu.: 7.00 1st Qu.: 6.00
## Median :1.000 Median :10.00 Median :10.00
## Mean :1.079 Mean :11.09 Mean :10.87
## 3rd Qu.:1.000 3rd Qu.:14.00 3rd Qu.:14.00
## Max. :5.000 Max. :42.00 Max. :41.00
## NA's :26796 NA's :26796 NA's :26796
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000
## Mean :0.205 Mean :0.011
## 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :7.000 Max. :5.000
## NA's :26796 NA's :26796
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## Min. : 1000 Min. : 0
## 1st Qu.: 2550 1st Qu.: 0
## Median : 4500 Median : 1970
## Mean : 6012 Mean : 3027
## 3rd Qu.: 7500 3rd Qu.: 4145
## Max. :40000 Max. :21862
## NA's :26796 NA's :26796
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## Min. :-160.000 Min. : 0.0
## 1st Qu.: 0.000 1st Qu.: 0.0
## Median : 0.000 Median : 0.0
## Mean : 7.363 Mean : 491.8
## 3rd Qu.: 40.000 3rd Qu.: 948.2
## Max. : 215.000 Max. :2704.0
## NA's :26798
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber
## Min. : 0.00 Min. : 39.00 Min. : 1
## 1st Qu.:10.00 1st Qu.: 70.00 1st Qu.: 7395
## Median :16.00 Median : 78.00 Median :19450
## Mean :17.32 Mean : 78.21 Mean :19418
## 3rd Qu.:24.00 3rd Qu.: 85.00 3rd Qu.:30463
## Max. :44.00 Max. :100.00 Max. :46378
## NA's :18376
## LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## Min. : 1000 Min. :2005-11-15 Q2 2008: 4344
## 1st Qu.: 2500 1st Qu.:2007-02-13 Q3 2008: 3602
## Median : 4500 Median :2007-09-21 Q2 2007: 3118
## Mean : 6159 Mean :2007-09-09 Q1 2007: 3079
## 3rd Qu.: 7904 3rd Qu.:2008-05-02 Q1 2008: 3074
## Max. :25000 Max. :2010-12-30 (Other):11845
## NA's : 22
## MemberKey MonthlyLoanPayment LP_CustomerPayments
## 3EF133647645155044BFFD9: 6 Min. : 0.00 Min. : 0
## 7E1733653050264822FAA3D: 6 1st Qu.: 84.84 1st Qu.: 1647
## 16083364744933457E57FB9: 4 Median : 153.80 Median : 3778
## 242A33660960718280E1642: 4 Mean : 215.72 Mean : 5683
## 5B8333756488098823F5EFE: 4 3rd Qu.: 275.77 3rd Qu.: 7403
## 63CA34120866140639431C9: 4 Max. :1130.90 Max. :40702
## (Other) :29056
## LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## Min. : 0 Min. : 0.0 Min. :-664.87
## 1st Qu.: 1069 1st Qu.: 335.4 1st Qu.: -76.15
## Median : 3000 Median : 779.3 Median : -33.50
## Mean : 4502 Mean : 1180.7 Mean : -54.97
## 3rd Qu.: 6000 3rd Qu.: 1532.2 3rd Qu.: -13.14
## Max. :25693 Max. :15617.0 Max. : 32.06
##
## LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## Min. :-9274.75 Min. : 0 Min. : -954.5
## 1st Qu.: 0.00 1st Qu.: 0 1st Qu.: 0.0
## Median : 0.00 Median : 0 Median : 0.0
## Mean : -31.86 Mean : 1647 Mean : 1596.6
## 3rd Qu.: 0.00 3rd Qu.: 1863 3rd Qu.: 1748.7
## Max. : 0.00 Max. :25000 Max. :25000.0
##
## LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## Min. : 0.00 Min. :1.000 Min. : 0.0000
## 1st Qu.: 0.00 1st Qu.:1.000 1st Qu.: 0.0000
## Median : 0.00 Median :1.000 Median : 0.0000
## Mean : 76.19 Mean :1.000 Mean : 0.1369
## 3rd Qu.: 0.00 3rd Qu.:1.000 3rd Qu.: 0.0000
## Max. :21117.90 Max. :1.011 Max. :39.0000
##
## InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## Min. : 0.00000 Min. : 0.00 Min. : 1.0
## 1st Qu.: 0.00000 1st Qu.: 0.00 1st Qu.: 34.0
## Median : 0.00000 Median : 0.00 Median : 78.0
## Mean : 0.06842 Mean : 52.25 Mean :116.1
## 3rd Qu.: 0.00000 3rd Qu.: 0.00 3rd Qu.:158.0
## Max. :33.00000 Max. :25000.00 Max. :913.0
##
That is indeed the case, and as the same percentage of the other similar measures is missing, I will assume this is also the case for those measures.
I see that some borrower demographic, employment, and previous credit information is missing, but I assume that this is simply missing data, with no larger story behind it, particularly as this is a relatively small percentage of loans. I also see that more of this information is missing for loans that have been closed, which suggests to me that this data was either lost, or not gathered as thoroughly in the past.
The majority of the borrowers in both categories have no prior Prosper history, and it would be interesting to see if, for example, not having any Prosper history leads to more delinquencies than having positive Prosper history.
Most loans were not charged off, but about 30% of closed loans at least at some point became delinquent (LoanFirstDefaultedCycleNumber). A very small number of open loans are delinquent.
At this point, I want to take a look, through plotting correlations, at how predictive the above background, financial, or demographic measures are of measures most closely related to lender profit.
LoanStatus vs. CreditGrade/ProsperRatingIn the case of LoanStatus, as this is not a quantitative or clearly ordered factor, it may make sense to at least visually organize some of the levels. I therefore ‘group’ all Past Due levels together, and order the levels loosely in terms of ‘goodness’ - assuming that being on time, or having paid off the loan, is ‘good,’ and that having defaulted, or having the loan charged off, is ‘bad.’ I group CreditGrade and ProsperRating into one measure, and then plot LoanStatus by this new rating, to see if there are any obvious patterns on how likely one is to have a particular loan status, given a particular starting rating.
What I see here is that the higher the rating, the greater the likelihood that the loan is either completed or current, and the less the likelihood that it is past due, charged off, or defaulted. Overall, it seems that a customer with a higher rating at the time the loan is posted will indeed be more likely to pay off a loan in the future.
First, I want to get a sense of when these measures might be getting assigned, in cases where documentation does not make this clear. To make this more clear, I will look at loans which have not been closed, and see if they systematically include this information (compared to loans which are closed). If they do, it’s relatively safe to say that these measures are predictions, rather than reports of actual yield.
summary(filter(data, is.na(ClosedDate)))
## ListingKey ListingNumber ListingCreationDate
## 17A93590655669644DB4C06: 6 Min. : 464139 Min. :2010-06-24
## 349D3587495831350F0F648: 4 1st Qu.: 682358 1st Qu.:2012-12-04
## 47C1359638497431975670B: 4 Median : 875238 Median :2013-08-20
## 8474358854651984137201C: 4 Mean : 870182 Mean :2013-05-16
## DE8535960513435199406CE: 4 3rd Qu.:1051465 3rd Qu.:2013-12-05
## 04C13599434217079754AEE: 3 Max. :1255725 Max. :2014-03-10
## (Other) :58823
## CreditGrade Term LoanStatus
## NC : 0 Min. :12.00 Current :56576
## HR : 0 1st Qu.:36.00 Past Due (1-15 days) : 806
## E : 0 Median :36.00 Past Due (31-60 days) : 363
## D : 0 Mean :44.47 Past Due (61-90 days) : 313
## C : 0 3rd Qu.:60.00 Past Due (91-120 days): 304
## (Other): 0 Max. :60.00 Past Due (16-30 days) : 265
## NA's :58848 (Other) : 221
## ClosedDate BorrowerAPR BorrowerRate LenderYield
## Min. :NA Min. :0.06106 Min. :0.0577 Min. :0.0477
## 1st Qu.:NA 1st Qu.:0.16056 1st Qu.:0.1334 1st Qu.:0.1234
## Median :NA Median :0.20679 Median :0.1769 Median :0.1669
## Mean :NA Mean :0.21568 Mean :0.1856 Mean :0.1756
## 3rd Qu.:NA 3rd Qu.:0.26877 3rd Qu.:0.2346 3rd Qu.:0.2246
## Max. :NA Max. :0.38486 Max. :0.3435 Max. :0.3335
## NA's :58848
## EstimatedEffectiveYield EstimatedLoss EstimatedReturn
## Min. :0.0474 Min. :0.00490 Min. :0.03700
## 1st Qu.:0.1181 1st Qu.:0.04200 1st Qu.:0.07400
## Median :0.1575 Median :0.06490 Median :0.08728
## Mean :0.1653 Mean :0.07435 Mean :0.09100
## 3rd Qu.:0.2086 3rd Qu.:0.10250 3rd Qu.:0.10790
## Max. :0.3057 Max. :0.20300 Max. :0.17610
##
## ProsperRating.num ProsperRating.alpha ProsperScore ListingCategory.num
## Min. :1.000 C :14528 Min. : 1.00 Min. : 0.000
## 1st Qu.:3.000 B :12208 1st Qu.: 4.00 1st Qu.: 1.000
## Median :4.000 A :10943 Median : 6.00 Median : 1.000
## Mean :4.253 D : 8405 Mean : 5.81 Mean : 3.118
## 3rd Qu.:5.000 E : 5965 3rd Qu.: 8.00 3rd Qu.: 2.000
## Max. :7.000 AA : 3589 Max. :11.00 Max. :20.000
## (Other): 3210
## BorrowerState Occupation EmploymentStatus
## CA : 7454 Other :14561 Employed :50831
## NY : 4214 Professional : 7113 Self-employed: 3208
## TX : 4090 Executive : 2522 Other : 3008
## FL : 3642 Teacher : 2111 Full-time : 1397
## IL : 2882 Computer Programmer: 1984 Not employed : 274
## OH : 2389 (Other) :29237 Retired : 98
## (Other):34177 NA's : 1320 (Other) : 32
## EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
## Min. : 0.0 Mode :logical Mode :logical
## 1st Qu.: 32.0 FALSE:27257 FALSE:57973
## Median : 79.0 TRUE :31591 TRUE :875
## Mean :108.3
## 3rd Qu.:156.0
## Max. :733.0
## NA's :10
## GroupKey DateCreditPulled
## 3D4D3366260257624AB272D: 110 Min. :2008-01-23
## 783C3371218786870A73D20: 79 1st Qu.:2012-12-03
## 52EA3425051368132B80C96: 41 Median :2013-08-22
## FEF83377364176536637E50: 29 Mean :2013-05-17
## 6A3B336601725506917317E: 26 3rd Qu.:2013-12-05
## (Other) : 387 Max. :2014-03-10
## NA's :58176
## CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine
## Min. :600.0 Min. :619.0 Min. :1951-01-01
## 1st Qu.:660.0 1st Qu.:679.0 1st Qu.:1990-03-01
## Median :700.0 Median :719.0 Median :1995-11-22
## Mean :698.4 Mean :717.4 Mean :1994-11-04
## 3rd Qu.:720.0 3rd Qu.:739.0 3rd Qu.:2000-05-11
## Max. :880.0 Max. :899.0 Max. :2012-12-22
##
## CurrentCreditLines OpenCreditLines TotalCreditLinespast7years
## Min. : 0.00 Min. : 0 Min. : 2.00
## 1st Qu.: 7.00 1st Qu.: 7 1st Qu.: 19.00
## Median :10.00 Median : 9 Median : 27.00
## Mean :10.92 Mean :10 Mean : 28.12
## 3rd Qu.:14.00 3rd Qu.:13 3rd Qu.: 36.00
## Max. :54.00 Max. :54 Max. :125.00
##
## OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months
## Min. : 0.000 Min. : 0.0 Min. : 0.0000
## 1st Qu.: 5.000 1st Qu.: 188.0 1st Qu.: 0.0000
## Median : 7.000 Median : 344.0 Median : 0.0000
## Mean : 7.805 Mean : 466.6 Mean : 0.8649
## 3rd Qu.:10.000 3rd Qu.: 606.0 3rd Qu.: 1.0000
## Max. :50.000 Max. :13765.0 Max. :15.0000
##
## TotalInquiries CurrentDelinquencies AmountDelinquent
## Min. : 0.000 Min. : 0.0000 Min. : 0
## 1st Qu.: 2.000 1st Qu.: 0.0000 1st Qu.: 0
## Median : 3.000 Median : 0.0000 Median : 0
## Mean : 4.134 Mean : 0.3015 Mean : 931
## 3rd Qu.: 6.000 3rd Qu.: 0.0000 3rd Qu.: 0
## Max. :78.000 Max. :51.0000 Max. :463881
##
## DelinquenciesLast7Years PublicRecordsLast10Years
## Min. : 0.000 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 0.000 Median : 0.0000
## Mean : 3.772 Mean : 0.2956
## 3rd Qu.: 2.000 3rd Qu.: 0.0000
## Max. :99.000 Max. :38.0000
##
## PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization
## Min. : 0.00000 Min. : 0 Min. :0.0000
## 1st Qu.: 0.00000 1st Qu.: 4736 1st Qu.:0.3700
## Median : 0.00000 Median : 10388 Median :0.6200
## Mean : 0.00814 Mean : 19140 Mean :0.5862
## 3rd Qu.: 0.00000 3rd Qu.: 21972 3rd Qu.:0.8300
## Max. :20.00000 Max. :999165 Max. :1.8200
##
## AvailableBankcardCredit TotalTrades TradesNeverDelinquent.per
## Min. : 0 Min. : 1.0 Min. :0.0800
## 1st Qu.: 1296 1st Qu.: 16.0 1st Qu.:0.8500
## Median : 4727 Median : 23.0 Median :0.9600
## Mean : 11506 Mean : 24.4 Mean :0.9097
## 3rd Qu.: 14111 3rd Qu.: 31.0 3rd Qu.:1.0000
## Max. :498374 Max. :108.0 Max. :1.0000
##
## TradesOpenedLast6Months DebtToIncomeRatio IncomeRange
## Min. : 0.0000 Min. : 0.000 $50,000-74,999:18261
## 1st Qu.: 0.0000 1st Qu.: 0.160 $25,000-49,999:15848
## Median : 0.0000 Median : 0.230 $100,000+ :11273
## Mean : 0.7159 Mean : 0.263 $75,000-99,999:10474
## 3rd Qu.: 1.0000 3rd Qu.: 0.320 $1-24,999 : 2703
## Max. :16.0000 Max. :10.010 Not employed : 274
## NA's :4324 (Other) : 15
## IncomeVerifiable StatedMonthlyIncome LoanKey
## Mode :logical Min. : 0 CB1B37030986463208432A1: 6
## FALSE:4368 1st Qu.: 3617 2DEE3698211017519D7333F: 4
## TRUE :54480 Median : 5167 9F4B37043517554537C364C: 4
## Mean : 6126 D895370150591392337ED6D: 4
## 3rd Qu.: 7417 E6FB37073953690388BC56D: 4
## Max. :1750003 0D8F37036734373301ED419: 3
## (Other) :58823
## TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments
## Min. :1.0 Min. : 0.00 Min. : 0.00
## 1st Qu.:1.0 1st Qu.: 10.00 1st Qu.: 10.00
## Median :1.0 Median : 17.00 Median : 17.00
## Mean :1.5 Mean : 25.54 Mean : 24.81
## 3rd Qu.:2.0 3rd Qu.: 35.00 3rd Qu.: 35.00
## Max. :8.0 Max. :141.00 Max. :141.00
## NA's :47302 NA's :47302 NA's :47302
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 0.00
## Mean : 0.68 Mean : 0.05
## 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :42.00 Max. :21.00
## NA's :47302 NA's :47302
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## Min. : 1000 Min. : 0.00
## 1st Qu.: 4000 1st Qu.: 0.01
## Median : 7400 Median : 2213.24
## Mean : 9721 Mean : 3475.83
## 3rd Qu.:13500 3rd Qu.: 5204.00
## Max. :72499 Max. :23450.95
## NA's :47302 NA's :47302
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## Min. :-209.0 Min. : 0.000
## 1st Qu.: -38.0 1st Qu.: 0.000
## Median : -9.0 Median : 0.000
## Mean : -8.6 Mean : 1.468
## 3rd Qu.: 18.0 3rd Qu.: 0.000
## Max. : 220.0 Max. :129.000
## NA's :50362
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber
## Min. : 1.00 Min. : 0.00 Min. : 43212
## 1st Qu.: 1.00 1st Qu.: 3.00 1st Qu.: 79386
## Median : 7.50 Median : 7.00 Median :100276
## Mean :11.88 Mean : 9.68 Mean : 98941
## 3rd Qu.:17.25 3rd Qu.:15.00 3rd Qu.:121614
## Max. :38.00 Max. :45.00 Max. :136486
## NA's :58840
## LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## Min. : 1500 Min. :2010-06-30 Q4 2013:14058
## 1st Qu.: 4000 1st Qu.:2012-12-18 Q1 2014:12103
## Median :10000 Median :2013-08-29 Q3 2013: 8592
## Mean :10280 Mean :2013-05-27 Q2 2013: 6268
## 3rd Qu.:15000 3rd Qu.:2013-12-16 Q3 2012: 3419
## Max. :35000 Max. :2014-03-12 Q4 2012: 3022
## (Other):11386
## MemberKey MonthlyLoanPayment LP_CustomerPayments
## F80D3694083622957BA09F2: 6 Min. : 0.0 Min. : 0
## 0F0C35762146892131F3BB4: 4 1st Qu.: 166.6 1st Qu.: 555
## 22B53699795042922A27DCC: 4 Median : 286.9 Median : 1516
## 61E93477058090904D07D4F: 4 Mean : 318.1 Mean : 2550
## 946A35068649687154063A9: 4 3rd Qu.: 415.1 3rd Qu.: 3367
## EA463494084516244B9C542: 4 Max. :2163.6 Max. :31613
## (Other) :58822
## LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## Min. : 0.0 Min. : 0.0 Min. :-564.85
## 1st Qu.: 286.8 1st Qu.: 221.7 1st Qu.: -73.29
## Median : 795.5 Median : 640.4 Median : -34.75
## Mean : 1519.1 Mean : 1031.3 Mean : -55.72
## 3rd Qu.: 1872.4 3rd Qu.: 1410.9 3rd Qu.: -13.11
## Max. :30831.1 Max. :10572.8 Max. : 0.77
##
## LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## Min. :-1242.460 Min. :0 Min. :0
## 1st Qu.: 0.000 1st Qu.:0 1st Qu.:0
## Median : 0.000 Median :0 Median :0
## Mean : -4.171 Mean :0 Mean :0
## 3rd Qu.: 0.000 3rd Qu.:0 3rd Qu.:0
## Max. : 0.000 Max. :0 Max. :0
##
## LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## Min. :0 Min. :0.7000 Min. : 0.000000
## 1st Qu.:0 1st Qu.:1.0000 1st Qu.: 0.000000
## Median :0 Median :1.0000 Median : 0.000000
## Mean :0 Mean :0.9986 Mean : 0.009312
## 3rd Qu.:0 3rd Qu.:1.0000 3rd Qu.: 0.000000
## Max. :0 Max. :1.0125 Max. :19.000000
##
## InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## Min. :0.00000 Min. : 0.0000 Min. : 1.00
## 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.: 1.00
## Median :0.00000 Median : 0.0000 Median : 8.00
## Mean :0.00226 Mean : 0.6037 Mean : 57.62
## 3rd Qu.:0.00000 3rd Qu.: 0.0000 3rd Qu.: 79.00
## Max. :6.00000 Max. :3000.0000 Max. :779.00
##
All of these open loans have non-zero values assigned to the following measures, suggesting that these measures are predictive rather than descriptive of actual outcomes: LenderYield, EstimatedEffectiveYield, EstimatedLoss, EstimatedReturn. On the other hand, many open loans have zero values assigned for these profit measures: LP_CustomerPayments, LP_CustomerPrincipalPayments, LP_InterestandFees, LP_ServiceFees, LP_CollectionFees, LP_GrossPrincipalLoss, LP_NetPrincipalLoss, and LP_NonPrincipalRecoverypayments (in fact, the last 3 have only zero values assigned). These I will take a closer look at.
Here, I quickly want to look at how well-correlated the numerical factors associated with profit are, to see if I need to look at all of them when seeing how predictive demographic factors are of profit. I expect, in any case, that the most productive factors to look at are EstimatedEffectiveYield (an overall view of how much lenders profit), EstimatedLoss (as this separately looks at principal loss), and LoanCurrentDaysDelinquent (as delinquency, even if the loan is ultimately paid, is likely of interest to lenders).
profit <- c("LenderYield", "EstimatedEffectiveYield", "EstimatedLoss", "LoanCurrentDaysDelinquent", "LP_GrossPrincipalLoss", "LP_NetPrincipalLoss","ProsperRating.num")
library(ggcorrplot)
corr <- cor(data[profit], use = "complete.obs")
head(corr[, 1:6])
## LenderYield EstimatedEffectiveYield
## LenderYield 1.0000000 0.8953425
## EstimatedEffectiveYield 0.8953425 1.0000000
## EstimatedLoss 0.9453084 0.7981346
## LoanCurrentDaysDelinquent 0.2157334 0.1342877
## LP_GrossPrincipalLoss 0.1362828 0.1394210
## LP_NetPrincipalLoss 0.1347704 0.1387118
## EstimatedLoss LoanCurrentDaysDelinquent
## LenderYield 0.94530836 0.2157334
## EstimatedEffectiveYield 0.79813456 0.1342877
## EstimatedLoss 1.00000000 0.1953217
## LoanCurrentDaysDelinquent 0.19532174 1.0000000
## LP_GrossPrincipalLoss 0.09333752 0.6034317
## LP_NetPrincipalLoss 0.09205293 0.6049597
## LP_GrossPrincipalLoss LP_NetPrincipalLoss
## LenderYield 0.13628281 0.13477041
## EstimatedEffectiveYield 0.13942102 0.13871181
## EstimatedLoss 0.09333752 0.09205293
## LoanCurrentDaysDelinquent 0.60343173 0.60495970
## LP_GrossPrincipalLoss 1.00000000 0.99330220
## LP_NetPrincipalLoss 0.99330220 1.00000000
ggcorrplot(corr, hc.order = TRUE, type = "lower",
outline.col = "white", lab = TRUE)
It turns out that most of the potential profit measures are not that well-correlated. Several, however, are well-correlated with each other (positively or negatively), and I expect these to likely be better representations of profit (as, if a potential measure of profit correlates with no other potential measures of profit, then it it unlikely to represent profit well, unless it is the single accurate measure of profit in the bunch, which is unlikely).
The profit measures showing the highest correlations with other measures are the following: EstimatedReturn, EstimatedEffectiveYield, EstimatedLoss, LenderYield, and ProsperRating. It’s likely that the other measures are informative for other, more specific, questions, but at a first glance, it makes sense to look at the most obvious measures of profit. It’s also possible that LoanStatus, previously looked at, is also informative, but it has a more indirect relationship to profit (especially given that, as a category, it is inherently in flux).
It is possible that measures of delinquency - LoanCurrentDaysDelinquent,OnTimeProsperPayments,CurrentDelinquencies, and AmountDelinquent would affect lenders’ willingness to engage with clients regardless of ultimate gain, or loss, particularly for lenders who rely on a regular ‘income.’ In this case, it would also be worth looking at how much the various demographic predictors correlate with these measures, particularly as they do not seem to be reflected by the Prosper rating (or other measures).
It is not clear to me exactly what LP_GrossPrincipalLoss and LP_NetPrincipalLoss mean, so I will not look at them for now. In addition, both seem reasonably well-correlated with one of the delinquency measures.
EstimatedReturn vs. EstimatedEffectiveYieldggplot(data, aes(x = EstimatedReturn, y = EstimatedEffectiveYield)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
stat_smooth(n=2000) +
labs(title = "EstimatedReturn by EstimatedEffectiveYield") +
ylim(-0.5,0.5)
## Warning: Removed 29084 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 29084 rows containing non-finite values (stat_smooth).
## Warning: Removed 29084 rows containing missing values (geom_point).
## Warning: Removed 437 rows containing missing values (geom_pointrange).
EstimatedReturn vs. EstimatedLossggplot(filter(data, !is.na(EstimatedReturn) & !is.na(EstimatedLoss)), aes(x = EstimatedReturn, y = EstimatedLoss)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "EstimatedReturn by EstimatedLoss") +
ylim(0,0.5)
## `geom_smooth()` using method = 'gam'
## Warning: Removed 568 rows containing missing values (geom_pointrange).
EstimatedReturn vs. LenderYieldggplot(data, aes(x = EstimatedReturn, y = LenderYield)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "EstimatedReturn by LenderYield") +
ylim(-0.1,0.5)
## Warning: Removed 29084 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 29084 rows containing non-finite values (stat_smooth).
## Warning: Removed 29084 rows containing missing values (geom_point).
## Warning: Removed 497 rows containing missing values (geom_pointrange).
EstimatedEffectiveYield vs. EstimatedLossggplot(data, aes(x = EstimatedEffectiveYield, y = EstimatedLoss)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "EstimatedEffectiveYield by EstimatedLoss") +
ylim(0,0.5)
## Warning: Removed 29084 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 29084 rows containing non-finite values (stat_smooth).
## Warning: Removed 29084 rows containing missing values (geom_point).
## Warning: Removed 638 rows containing missing values (geom_pointrange).
EstimatedEffectiveYield vs. LenderYieldggplot(data, aes(x = EstimatedEffectiveYield, y = LenderYield)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "EstimatedEffectiveYield by LenderYield") +
ylim(-0.1,0.5)
## Warning: Removed 29084 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 29084 rows containing non-finite values (stat_smooth).
## Warning: Removed 29084 rows containing missing values (geom_point).
## Warning: Removed 552 rows containing missing values (geom_pointrange).
EstimatedLoss vs. LenderYieldggplot(data, aes(x = EstimatedLoss, y = LenderYield)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "EstimatedLoss by LenderYield") +
ylim(-0.1,0.5)
## Warning: Removed 29084 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 29084 rows containing non-finite values (stat_smooth).
## Warning: Removed 29084 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_pointrange).
This is one of the more interesting graphs - it shows clearly that while lender yield increases with estimated loss, at higher levels of loss, the yield ceases to increase, and levels off, or even drops slightly towards higher levels of loss.
If credit grades are only assigned once the fate of the loan is known, it may be more useful to look at how predictive pre-existing factors such as Occupation are of profit measures.
subset <- c("EstimatedReturn", "EstimatedEffectiveYield", "EstimatedLoss", "LenderYield", "ProsperRating.num","Occupation")
plot_data <- data[subset] %>%
group_by(Occupation) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
mutate(Occupation = reorder(Occupation, ProsperRating.num, mean)) %>%
gather(Measure, Value, -Occupation)
ggplot(plot_data, aes(x = Occupation, y=Value)) +
geom_bar(stat = "identity") +
labs(title = "Profit by Occupation") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
There are too many occupations to make easy generalizations. Occupations would likely need to be grouped into a smaller number of categories. However, one can observe general trends - those with higher-paying occupations (or occupational prospects) seem to be more profitable customers. On the other hand, students in general are among the bank’s least profitable customers. This suggests that income, which is grouped in a more sensible manner, may be useful to look at.
subset <- c("EstimatedReturn", "EstimatedEffectiveYield", "EstimatedLoss", "LenderYield", "ProsperRating.num","IncomeRange")
plot_data <- data[subset] %>%
group_by(IncomeRange) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -IncomeRange)
ggplot(plot_data, aes(x = IncomeRange, y=Value)) +
geom_bar(stat = "identity") +
labs(title = "Profit by IncomeRange") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 4 rows containing missing values (position_stack).
Here, it can be seen that as income rises, the ProsperRating increases, and other measures of profit decrease. As we have seen, ProsperRating correlates with credit score and likelihood of not defaulting. This suggests that high-income lenders are lower-risk, but lower-income lenders, while being higher-risk, can also yield more profit.
Here, I will look at the relationship between other predictors and profit measures.
EmploymentStatussubset <- c("EstimatedReturn", "EstimatedEffectiveYield", "EstimatedLoss", "LenderYield", "ProsperRating.num","EmploymentStatus")
plot_data <- data[subset] %>%
group_by(EmploymentStatus) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -EmploymentStatus)
ggplot(plot_data, aes(x = EmploymentStatus, y=Value)) +
geom_bar(stat = "identity") +
labs(title = "Profit by EmploymentStatus") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 8 rows containing missing values (position_stack).
What it looks like here is that Prosper ratings are highest for those employed, and employed full-time (it’s not clear what the difference is), lower for those who are self-employed, retired, work part-time, or ‘other,’ and much lower for those not employed. LenderYield, EstimatedEffectiveYield, and EstimatedReturn, however, are highest for those not employed, likely reflecting the higher anticipated interest charged to people in that group. Estimated Loss, correspondingly, is also highest for those not employed - there’s higher potential profit if the loans are paid back, but also significantly more risk.
EmploymentStatusDurationsubset <- c("EstimatedReturn", "EstimatedEffectiveYield", "EstimatedLoss", "LenderYield", "ProsperRating.num","EmploymentStatusDuration")
plot_data <- data[subset] %>%
group_by(EmploymentStatusDuration) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -EmploymentStatusDuration)
ggplot(plot_data, aes(x = EmploymentStatusDuration, y=Value)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "Profit by EmploymentStatusDuration") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 13 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'loess'
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).
## Warning: Removed 3017 rows containing missing values (geom_pointrange).
IsBorrowerHomeownersubset <- c("EstimatedReturn", "EstimatedEffectiveYield", "EstimatedLoss", "LenderYield", "ProsperRating.num","IsBorrowerHomeowner")
plot_data <- data[subset] %>%
group_by(IsBorrowerHomeowner) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -IsBorrowerHomeowner)
ggplot(plot_data, aes(x = IsBorrowerHomeowner, y=Value)) +
geom_bar(stat = "identity") +
labs(title = "Profit by IsBorrowerHomeowner") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
FirstRecordedCreditLinesubset <- c("EstimatedReturn", "EstimatedEffectiveYield", "EstimatedLoss", "LenderYield", "ProsperRating.num","FirstRecordedCreditLine")
plot_data <- data[subset] %>%
group_by(FirstRecordedCreditLine) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -FirstRecordedCreditLine)
ggplot(plot_data, aes(x = FirstRecordedCreditLine, y=Value)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "Profit by FirstRecordedCreditLine") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 1961 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 1961 rows containing non-finite values (stat_smooth).
## Warning: Removed 1961 rows containing missing values (geom_point).
## Warning: Removed 55969 rows containing missing values (geom_pointrange).
OpenRevolvingAccountssubset <- c("EstimatedReturn", "EstimatedEffectiveYield", "EstimatedLoss", "LenderYield", "ProsperRating.num","OpenRevolvingAccounts")
plot_data <- data[subset] %>%
group_by(OpenRevolvingAccounts) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -OpenRevolvingAccounts)
ggplot(plot_data, aes(x = OpenRevolvingAccounts, y=Value)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "Profit by OpenRevolvingAccounts") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 8 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'loess'
## Warning: Removed 8 rows containing non-finite values (stat_smooth).
## Warning: Removed 8 rows containing missing values (geom_point).
## Warning: Removed 232 rows containing missing values (geom_pointrange).
InquiriesLast6Monthssubset <- c("EstimatedReturn", "EstimatedEffectiveYield", "EstimatedLoss", "LenderYield", "ProsperRating.num","InquiriesLast6Months")
plot_data <- data[subset] %>%
group_by(InquiriesLast6Months) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -InquiriesLast6Months)
ggplot(plot_data, aes(x = InquiriesLast6Months, y=Value)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "Profit by InquiriesLast6Months") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 113 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'loess'
## Warning: Removed 113 rows containing non-finite values (stat_smooth).
## Warning: Removed 113 rows containing missing values (geom_point).
## Warning: Removed 142 rows containing missing values (geom_pointrange).
RevolvingCreditBalancesubset <- c("EstimatedReturn", "EstimatedEffectiveYield", "EstimatedLoss", "LenderYield", "ProsperRating.num","RevolvingCreditBalance")
plot_data <- data[subset] %>%
group_by(RevolvingCreditBalance) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -RevolvingCreditBalance)
ggplot(plot_data, aes(x = RevolvingCreditBalance, y=Value)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "Profit by RevolvingCreditBalance") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 13509 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 13509 rows containing non-finite values (stat_smooth).
## Warning: Removed 13509 rows containing missing values (geom_point).
## Warning: Removed 179271 rows containing missing values (geom_pointrange).
BankcardUtilizationsubset <- c("EstimatedReturn", "EstimatedEffectiveYield", "EstimatedLoss", "LenderYield", "ProsperRating.num","BankcardUtilization")
plot_data <- data[subset] %>%
group_by(BankcardUtilization) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -BankcardUtilization)
ggplot(plot_data, aes(x = BankcardUtilization, y=Value)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "Profit by BankcardUtilization") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 237 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'loess'
## Warning: Removed 237 rows containing non-finite values (stat_smooth).
## Warning: Removed 237 rows containing missing values (geom_point).
## Warning: Removed 773 rows containing missing values (geom_pointrange).
TotalTradessubset <- c("EstimatedReturn", "EstimatedEffectiveYield", "EstimatedLoss", "LenderYield", "ProsperRating.num","TotalTrades")
plot_data <- data[subset] %>%
group_by(TotalTrades) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -TotalTrades)
ggplot(plot_data, aes(x = TotalTrades, y=Value)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "Profit by TotalTrades") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 21 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'loess'
## Warning: Removed 21 rows containing non-finite values (stat_smooth).
## Warning: Removed 21 rows containing missing values (geom_point).
## Warning: Removed 524 rows containing missing values (geom_pointrange).
DebtToIncomeRatiosubset <- c("EstimatedReturn", "EstimatedEffectiveYield", "EstimatedLoss", "LenderYield", "ProsperRating.num","DebtToIncomeRatio")
plot_data <- data[subset] %>%
group_by(DebtToIncomeRatio) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -DebtToIncomeRatio)
ggplot(plot_data, aes(x = DebtToIncomeRatio, y=Value)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "Profit by DebtToIncomeRatio") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 3797 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 3797 rows containing non-finite values (stat_smooth).
## Warning: Removed 3797 rows containing missing values (geom_point).
## Warning: Removed 2243 rows containing missing values (geom_pointrange).
IncomeVerifiablesubset <- c("EstimatedReturn", "EstimatedEffectiveYield", "EstimatedLoss", "LenderYield", "ProsperRating.num","IncomeVerifiable")
plot_data <- data[subset] %>%
group_by(IncomeVerifiable) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -IncomeVerifiable)
ggplot(plot_data, aes(x = IncomeVerifiable, y=Value)) +
geom_bar(stat = "identity") +
labs(title = "Profit by IncomeVerifiable") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
TotalProsperLoanssubset <- c("EstimatedReturn", "EstimatedEffectiveYield", "EstimatedLoss", "LenderYield", "ProsperRating.num","TotalProsperLoans")
plot_data <- data[subset] %>%
group_by(TotalProsperLoans) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -TotalProsperLoans)
ggplot(plot_data, aes(x = TotalProsperLoans, y=Value)) +
geom_bar(stat = "identity") +
labs(title = "Profit by TotalProsperLoans") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 5 rows containing missing values (position_stack).
LoanOriginationQuartersubset <- c("EstimatedReturn", "EstimatedEffectiveYield", "EstimatedLoss", "LenderYield", "ProsperRating.num","LoanOriginationQuarter")
plot_data <- data[subset] %>%
group_by(LoanOriginationQuarter) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -LoanOriginationQuarter)
ggplot(plot_data, aes(x = LoanOriginationQuarter, y=Value)) +
geom_bar(stat = "identity") +
labs(title = "Profit by LoanOriginationQuarter") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 56 rows containing missing values (position_stack).
Ultimately, without knowing how the loans ultimately panned out, it is a bit difficult to use this data to make future predictions.
First, I want to see which Prosper ratings I see delinquency with.
Occupationsubset <- c("AmountDelinquent", "CurrentDelinquencies", "OnTimeProsperPayments", "LoanCurrentDaysDelinquent","Occupation")
plot_data <- data[subset] %>%
group_by(Occupation) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
mutate(Occupation = reorder(Occupation, OnTimeProsperPayments, mean)) %>%
gather(Measure, Value, -Occupation)
ggplot(plot_data, aes(x = Occupation, y=Value)) +
geom_bar(stat = "identity") +
labs(title = "Delinquency by Occupation") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 1 rows containing missing values (position_stack).
There are too many occupations to make easy generalizations. Occupations would likely need to be grouped into a smaller number of categories. However, one can observe general trends - those with higher-paying occupations (or occupational prospects) seem to be more profitable customers. On the other hand, students in general are among the bank’s least profitable customers. This suggests that income, which is grouped in a more sensible manner, may be useful to look at.
IncomeRangesubset <- c("AmountDelinquent", "CurrentDelinquencies", "OnTimeProsperPayments", "LoanCurrentDaysDelinquent","IncomeRange")
plot_data <- data[subset] %>%
group_by(IncomeRange) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -IncomeRange)
ggplot(plot_data, aes(x = IncomeRange, y=Value)) +
geom_bar(stat = "identity") +
labs(title = "Delinquency by IncomeRange") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 1 rows containing missing values (position_stack).
Here, it can be seen that as income rises, the ProsperRating increases, and other measures of profit decrease. As we have seen, ProsperRating correlates with credit score and likelihood of not defaulting. This suggests that high-income lenders are lower-risk, but lower-income lenders, while being higher-risk, can also yield more profit.
EmploymentStatussubset <- c("AmountDelinquent", "CurrentDelinquencies", "OnTimeProsperPayments", "LoanCurrentDaysDelinquent","EmploymentStatus")
plot_data <- data[subset] %>%
group_by(EmploymentStatus) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -EmploymentStatus)
ggplot(plot_data, aes(x = EmploymentStatus, y=Value)) +
geom_bar(stat = "identity") +
labs(title = "Delinquency by EmploymentStatus") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 4 rows containing missing values (position_stack).
What it looks like here is that Prosper ratings are highest for those employed, and employed full-time (it’s not clear what the difference is), lower for those who are self-employed, retired, work part-time, or ‘other,’ and much lower for those not employed. LenderYield, EstimatedEffectiveYield, and EstimatedReturn, however, are highest for those not employed, likely reflecting the higher anticipated interest charged to people in that group. Estimated Loss, correspondingly, is also highest for those not employed - there’s higher potential profit if the loans are paid back, but also significantly more risk.
EmploymentStatusDurationsubset <- c("AmountDelinquent", "CurrentDelinquencies", "OnTimeProsperPayments", "LoanCurrentDaysDelinquent","EmploymentStatusDuration")
plot_data <- data[subset] %>%
group_by(EmploymentStatusDuration) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -EmploymentStatusDuration)
ggplot(plot_data, aes(x = EmploymentStatusDuration, y=Value)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "Delinquency by EmploymentStatusDuration") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 102 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'loess'
## Warning: Removed 102 rows containing non-finite values (stat_smooth).
## Warning: Removed 102 rows containing missing values (geom_point).
## Warning: Removed 2322 rows containing missing values (geom_pointrange).
IsBorrowerHomeownersubset <- c("AmountDelinquent", "CurrentDelinquencies", "OnTimeProsperPayments", "LoanCurrentDaysDelinquent","IsBorrowerHomeowner")
plot_data <- data[subset] %>%
group_by(IsBorrowerHomeowner) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -IsBorrowerHomeowner)
ggplot(plot_data, aes(x = IsBorrowerHomeowner, y=Value)) +
geom_bar(stat = "identity") +
labs(title = "Delinquency by IsBorrowerHomeowner") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
FirstRecordedCreditLinesubset <- c("AmountDelinquent", "CurrentDelinquencies", "OnTimeProsperPayments", "LoanCurrentDaysDelinquent","FirstRecordedCreditLine")
plot_data <- data[subset] %>%
group_by(FirstRecordedCreditLine) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -FirstRecordedCreditLine)
ggplot(plot_data, aes(x = FirstRecordedCreditLine, y=Value)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "Delinquency by FirstRecordedCreditLine") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 5117 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 5117 rows containing non-finite values (stat_smooth).
## Warning: Removed 5117 rows containing missing values (geom_point).
## Warning: Removed 41227 rows containing missing values (geom_pointrange).
OpenRevolvingAccountssubset <- c("AmountDelinquent", "CurrentDelinquencies", "OnTimeProsperPayments", "LoanCurrentDaysDelinquent","OpenRevolvingAccounts")
plot_data <- data[subset] %>%
group_by(OpenRevolvingAccounts) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -OpenRevolvingAccounts)
ggplot(plot_data, aes(x = OpenRevolvingAccounts, y=Value)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "Delinquency by OpenRevolvingAccounts") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 8 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'loess'
## Warning: Removed 8 rows containing non-finite values (stat_smooth).
## Warning: Removed 8 rows containing missing values (geom_point).
## Warning: Removed 184 rows containing missing values (geom_pointrange).
InquiriesLast6Monthssubset <- c("AmountDelinquent", "CurrentDelinquencies", "OnTimeProsperPayments", "LoanCurrentDaysDelinquent","InquiriesLast6Months")
plot_data <- data[subset] %>%
group_by(InquiriesLast6Months) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -InquiriesLast6Months)
ggplot(plot_data, aes(x = InquiriesLast6Months, y=Value)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "Delinquency by InquiriesLast6Months") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 30 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'loess'
## Warning: Removed 30 rows containing non-finite values (stat_smooth).
## Warning: Removed 30 rows containing missing values (geom_point).
## Warning: Removed 174 rows containing missing values (geom_pointrange).
RevolvingCreditBalancesubset <- c("AmountDelinquent", "CurrentDelinquencies", "OnTimeProsperPayments", "LoanCurrentDaysDelinquent","RevolvingCreditBalance")
plot_data <- data[subset] %>%
group_by(RevolvingCreditBalance) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -RevolvingCreditBalance)
ggplot(plot_data, aes(x = RevolvingCreditBalance, y=Value)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "Delinquency by RevolvingCreditBalance") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 23589 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 23589 rows containing non-finite values (stat_smooth).
## Warning: Removed 23589 rows containing missing values (geom_point).
## Warning: Removed 130635 rows containing missing values (geom_pointrange).
BankcardUtilizationsubset <- c("AmountDelinquent", "CurrentDelinquencies", "OnTimeProsperPayments", "LoanCurrentDaysDelinquent","BankcardUtilization")
plot_data <- data[subset] %>%
group_by(BankcardUtilization) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -BankcardUtilization)
ggplot(plot_data, aes(x = BankcardUtilization, y=Value)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "Delinquency by BankcardUtilization") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 55 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'loess'
## Warning: Removed 55 rows containing non-finite values (stat_smooth).
## Warning: Removed 55 rows containing missing values (geom_point).
## Warning: Removed 753 rows containing missing values (geom_pointrange).
TotalTradessubset <- c("AmountDelinquent", "CurrentDelinquencies", "OnTimeProsperPayments", "LoanCurrentDaysDelinquent","TotalTrades")
plot_data <- data[subset] %>%
group_by(TotalTrades) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -TotalTrades)
ggplot(plot_data, aes(x = TotalTrades, y=Value)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "Delinquency by TotalTrades") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 12 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'loess'
## Warning: Removed 12 rows containing non-finite values (stat_smooth).
## Warning: Removed 12 rows containing missing values (geom_point).
## Warning: Removed 424 rows containing missing values (geom_pointrange).
DebtToIncomeRatiosubset <- c("AmountDelinquent", "CurrentDelinquencies", "OnTimeProsperPayments", "LoanCurrentDaysDelinquent","DebtToIncomeRatio")
plot_data <- data[subset] %>%
group_by(DebtToIncomeRatio) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -DebtToIncomeRatio)
ggplot(plot_data, aes(x = DebtToIncomeRatio, y=Value)) +
geom_point() +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth(formula = y~x) +
labs(title = "Delinquency by DebtToIncomeRatio") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 2506 rows containing non-finite values (stat_summary).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 2506 rows containing non-finite values (stat_smooth).
## Warning: Removed 2506 rows containing missing values (geom_point).
## Warning: Removed 2326 rows containing missing values (geom_pointrange).
IncomeVerifiablesubset <- c("AmountDelinquent", "CurrentDelinquencies", "OnTimeProsperPayments", "LoanCurrentDaysDelinquent","IncomeVerifiable")
plot_data <- data[subset] %>%
group_by(IncomeVerifiable) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -IncomeVerifiable)
ggplot(plot_data, aes(x = IncomeVerifiable, y=Value)) +
geom_bar(stat = "identity") +
labs(title = "Delinquency by IncomeVerifiable") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
TotalProsperLoanssubset <- c("AmountDelinquent", "CurrentDelinquencies", "OnTimeProsperPayments", "LoanCurrentDaysDelinquent","TotalProsperLoans")
plot_data <- data[subset] %>%
group_by(TotalProsperLoans) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -TotalProsperLoans)
ggplot(plot_data, aes(x = TotalProsperLoans, y=Value)) +
geom_bar(stat = "identity") +
labs(title = "Delinquency by TotalProsperLoans") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 4 rows containing missing values (position_stack).
LoanOriginationQuartersubset <- c("AmountDelinquent", "CurrentDelinquencies", "OnTimeProsperPayments", "LoanCurrentDaysDelinquent","LoanOriginationQuarter")
plot_data <- data[subset] %>%
group_by(LoanOriginationQuarter) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(Measure, Value, -LoanOriginationQuarter)
ggplot(plot_data, aes(x = LoanOriginationQuarter, y=Value)) +
geom_bar(stat = "identity") +
labs(title = "Delinquency by LoanOriginationQuarter") +
facet_grid(~Measure, scales="free") +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 14 rows containing missing values (position_stack).
Ultimately, without knowing how the loans ultimately panned out, it is a bit difficult to use this data to make future predictions.
What is notable above is that in each graph where there is a noticeable relationship between profit predictors and profit measures, the Prosper rating is inversely correlated with the profit measures. What is also notable is that there is a consistent relationship between lender yield, and lender loss: the more the lender stands to gain, the more they stand to lose. I look at this in more detail below. What is also notable is that estimated effective yield is always a bit less than both the estimated yield, reflecting also the estimated loss.
Assuming that the various profit measures, which may reflect only profit for clients/lenders, rather than for the company itself, are in fact what we want to be looking at, it is possible to notice certain trends which may be worth looking at more closely.
Further, assuming that lenders also care about potential missed payments, particularly if this would put them in a financial bind, it is worth looking at strong demographic predictors of delinquency, which does not appear to be reflected in the Prosper rating.
First, I encountered a fair bit of trouble interpreting the data without any background story. Googling around for info on Prosper loans online, I was able to get a general idea of what the company was doing, which made interpreting the data somewhat easier.
At this point, it is still difficult to say much about this data without knowing, in a lot more detail: what realities the less obvious measures reflect; the story behind the data; how certain measures are gathered and determined; and how the various measures reflect on both profit for the company, and profit for the clients. What would be needed is to take a much closer and more in-depth look at what the company does, what purpose the data serves, and how the measures were collected and what they reflect.